Training

The training section of the configuration lets you specify some parameters of the training process, like for instance the number of epochs or the learning rate.

These are the available training parameters:

batch_size (default 128): size of the batch used for training the model.
eval_batch_size (default 0): size of the batch used for evaluating the model. If it is 0, the same value of batch_size is used. This is usefult to speedup evaluation with a much bigger batch size than training, if enough memory is available, or to decrease the batch size when sampled_softmax_cross_entropy is used as loss for sequential and categorical features with big vocabulary sizes (evaluation needs to be performed on the full vocabulary, so a much smaller batch size may be needed to fit the activation tensors in memory).
epochs (default 100): number of epochs the training process will run for.
early_stop (default 5): if there's a validation set, number of epochs of patience without an improvement on the validation measure before the training is stopped.
optimizer (default {type: adam, beta1: 0.9, beta2: 0.999, epsilon: 1e-08}): which optimizer to use with the relative parameters. The available optimizers are: sgd (or stochastic_gradient_descent, gd, gradient_descent, they are all the same), adam, adadelta, adagrad, adamax, ftrl, nadam, rmsprop. To know their parameters check TensorFlow's optimizer documentation.
learning_rate (default 0.001): the learning rate to use.
decay (default false): if to use exponential decay of the learning rate or not.
decay_rate (default 0.96): the rate of the exponential learning rate decay.
decay_steps (default 10000): the number of steps of the exponential learning rate decay.
staircase (default false): decays the learning rate at discrete intervals.
regularization_lambda (default 0): the lambda parameter used for adding a l2 regularization loss to the overall loss.
reduce_learning_rate_on_plateau (default 0): if there's a validation set, how many times to reduce the learning rate when a plateau of validation measure is reached.
reduce_learning_rate_on_plateau_patience (default 5): if there's a validation set, number of epochs of patience without an improvement on the validation measure before reducing the learning rate.
reduce_learning_rate_on_plateau_rate (default 0.5): if there's a validation set, the reduction rate of the learning rate.
increase_batch_size_on_plateau (default 0): if there's a validation set, how many times to increase the batch size when a plateau of validation measure is reached.
increase_batch_size_on_plateau_patience (default 5): if there's a validation set, number of epochs of patience without an improvement on the validation measure before increasing the learning rate.
increase_batch_size_on_plateau_rate (default 2): if there's a validation set, the increase rate of the batch size.
increase_batch_size_on_plateau_max (default 512): if there's a validation set, the maximum value of batch size.
validation_field (default combined): when there is more than one output feature, which one to use for computing if there was an improvement on validation. The measure to use to determine if there was an improvement can be set with the validation_measure parameter. Different datatypes have different available measures, refer to the datatype-specific section for more details. combined indicates the use the combination of all features. For instance the combination of combined and loss as measure uses a decrease in the combined loss of all output features to check for improvement on validation, while combined and accuracy considers on how many datapoints the predictions for all output features were correct (but consider that for some features, for instance numeric there is no accuracy measure, so you should use accuracy only if all your output features have an accuracy measure).
validation_metric: (default loss): the metric to use to determine if there was an improvement. The metric is considered for the output feature specified in validation_field. Different datatypes have different available metrics, refer to the datatype-specific section for more details.
bucketing_field (default null): when not null, when creating batches, instead of shuffling randomly, the length along the last dimension of the matrix of the specified input feature is used for bucketing datapoints and then randomly shuffled datapoints from the same bin are sampled. Padding is trimmed to the longest datapoint in the batch. The specified feature should be either a sequence or text feature and the encoder encoding it has to be rnn. When used, bucketing improves speed of rnn encoding up to 1.5x, depending on the length distribution of the inputs.
learning_rate_warmup_epochs (default 1): It's the number or training epochs where learning rate warmup will be used. It is calculated as described in Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. In the paper the authors suggest 6 epochs of warmup, that parameter is suggested for large datasets and big batches.

Optimizers details¶

The available optimizers wrap the ones available in TensorFlow. For details about the parameters pleease refer to the TensorFlow documentation.

The learning_rate parameter the optimizer will use come from the training section. Other optimizer specific parameters, shown with their Ludwig default settings, follow:

sgd (or stochastic_gradient_descent, gd, gradient_descent)

'momentum': 0.0,
'nesterov': false

adam

'beta_1': 0.9,
'beta_2': 0.999,
'epsilon': 1e-08

adadelta

'rho': 0.95,
'epsilon': 1e-08

adagrad

'initial_accumulator_value': 0.1,
'epsilon': 1e-07

adamax

'beta_1': 0.9, 
'beta_2': 0.999, 
'epsilon': 1e-07

ftrl

'learning_rate_power': -0.5, 
'initial_accumulator_value': 0.1,
'l1_regularization_strength': 0.0, 
'l2_regularization_strength': 0.0,

nadam,

'beta_1': 0.9, 
'beta_2': 0.999, 
'epsilon': 1e-07

rmsprop

'decay': 0.9,
'momentum': 0.0,
'epsilon': 1e-10,
'centered': false