Training
The training
section of the configuration lets you specify some parameters of the training process, like for instance the number of epochs or the learning rate.
These are the available training parameters:
batch_size
(default128
): size of the batch used for training the model.eval_batch_size
(default0
): size of the batch used for evaluating the model. If it is0
, the same value ofbatch_size
is used. This is usefult to speedup evaluation with a much bigger batch size than training, if enough memory is available, or to decrease the batch size whensampled_softmax_cross_entropy
is used as loss for sequential and categorical features with big vocabulary sizes (evaluation needs to be performed on the full vocabulary, so a much smaller batch size may be needed to fit the activation tensors in memory).epochs
(default100
): number of epochs the training process will run for.early_stop
(default5
): if there's a validation set, number of epochs of patience without an improvement on the validation measure before the training is stopped.optimizer
(default{type: adam, beta1: 0.9, beta2: 0.999, epsilon: 1e-08}
): which optimizer to use with the relative parameters. The available optimizers are:sgd
(orstochastic_gradient_descent
,gd
,gradient_descent
, they are all the same),adam
,adadelta
,adagrad
,adamax
,ftrl
,nadam
,rmsprop
. To know their parameters check TensorFlow's optimizer documentation.learning_rate
(default0.001
): the learning rate to use.decay
(defaultfalse
): if to use exponential decay of the learning rate or not.decay_rate
(default0.96
): the rate of the exponential learning rate decay.decay_steps
(default10000
): the number of steps of the exponential learning rate decay.staircase
(defaultfalse
): decays the learning rate at discrete intervals.regularization_lambda
(default0
): the lambda parameter used for adding a l2 regularization loss to the overall loss.reduce_learning_rate_on_plateau
(default0
): if there's a validation set, how many times to reduce the learning rate when a plateau of validation measure is reached.reduce_learning_rate_on_plateau_patience
(default5
): if there's a validation set, number of epochs of patience without an improvement on the validation measure before reducing the learning rate.reduce_learning_rate_on_plateau_rate
(default0.5
): if there's a validation set, the reduction rate of the learning rate.increase_batch_size_on_plateau
(default0
): if there's a validation set, how many times to increase the batch size when a plateau of validation measure is reached.increase_batch_size_on_plateau_patience
(default5
): if there's a validation set, number of epochs of patience without an improvement on the validation measure before increasing the learning rate.increase_batch_size_on_plateau_rate
(default2
): if there's a validation set, the increase rate of the batch size.increase_batch_size_on_plateau_max
(default512
): if there's a validation set, the maximum value of batch size.validation_field
(defaultcombined
): when there is more than one output feature, which one to use for computing if there was an improvement on validation. The measure to use to determine if there was an improvement can be set with thevalidation_measure
parameter. Different datatypes have different available measures, refer to the datatype-specific section for more details.combined
indicates the use the combination of all features. For instance the combination ofcombined
andloss
as measure uses a decrease in the combined loss of all output features to check for improvement on validation, whilecombined
andaccuracy
considers on how many datapoints the predictions for all output features were correct (but consider that for some features, for instancenumeric
there is no accuracy measure, so you should useaccuracy
only if all your output features have an accuracy measure).validation_metric:
(defaultloss
): the metric to use to determine if there was an improvement. The metric is considered for the output feature specified invalidation_field
. Different datatypes have different available metrics, refer to the datatype-specific section for more details.bucketing_field
(defaultnull
): when notnull
, when creating batches, instead of shuffling randomly, the length along the last dimension of the matrix of the specified input feature is used for bucketing datapoints and then randomly shuffled datapoints from the same bin are sampled. Padding is trimmed to the longest datapoint in the batch. The specified feature should be either asequence
ortext
feature and the encoder encoding it has to bernn
. When used, bucketing improves speed ofrnn
encoding up to 1.5x, depending on the length distribution of the inputs.learning_rate_warmup_epochs
(default1
): It's the number or training epochs where learning rate warmup will be used. It is calculated as described in Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. In the paper the authors suggest6
epochs of warmup, that parameter is suggested for large datasets and big batches.
Optimizers details¶
The available optimizers wrap the ones available in TensorFlow. For details about the parameters pleease refer to the TensorFlow documentation.
The learning_rate
parameter the optimizer will use come from the training
section.
Other optimizer specific parameters, shown with their Ludwig default settings, follow:
sgd
(orstochastic_gradient_descent
,gd
,gradient_descent
)
'momentum': 0.0,
'nesterov': false
adam
'beta_1': 0.9,
'beta_2': 0.999,
'epsilon': 1e-08
adadelta
'rho': 0.95,
'epsilon': 1e-08
adagrad
'initial_accumulator_value': 0.1,
'epsilon': 1e-07
adamax
'beta_1': 0.9,
'beta_2': 0.999,
'epsilon': 1e-07
ftrl
'learning_rate_power': -0.5,
'initial_accumulator_value': 0.1,
'l1_regularization_strength': 0.0,
'l2_regularization_strength': 0.0,
nadam
,
'beta_1': 0.9,
'beta_2': 0.999,
'epsilon': 1e-07
rmsprop
'decay': 0.9,
'momentum': 0.0,
'epsilon': 1e-10,
'centered': false