Trainer
Overview¶
The trainer
section of the configuration lets you specify parameters that
configure the training process, like the number of epochs or the learning rate.
trainer:
epochs: 100
train_steps: None
early_stop: 5
batch_size: 128
eval_batch_size: null
evaluate_training_set: True
checkpoints_per_epoch: 0
steps_per_checkpoint: 0
regularization_lambda: 0
regularization_type: l2
learning_rate: 0.001
reduce_learning_rate_on_plateau: 0
reduce_learning_rate_on_plateau_patience: 5
reduce_learning_rate_on_plateau_rate: 0.5
increase_batch_size_on_plateau: 0
increase_batch_size_on_plateau_patience: 5
increase_batch_size_on_plateau_rate: 2
increase_batch_size_on_plateau_max: 512
decay: false
decay_steps: 10000
decay_rate: 0.96
staircase: false
validation_field: combined
validation_metric: loss
bucketing_field: null
learning_rate_warmup_epochs: 1
optimizer:
type: adam
beta1: 0.9
beta2: 0.999
epsilon: 1e-08
clip_global_norm: 0.5
clipnorm: null
clip_value: null
Trainer parameters¶
epochs
(default100
): number of epochs the training process will run for.train_steps
(defaultNone
): Maximum number of training steps the training process will run for. If unset, thenepochs
is used to determine training length.early_stop
(default5
): Number of consecutive rounds of evaluation without any improvement on thevalidation_metric
that triggers training to stop. Can be set to -1, which disables early stopping entirely.batch_size
(default128
): size of the batch used for training the model.eval_batch_size
(defaultnull
): size of the batch used for evaluating the model. If it is0
ornull
, the same value ofbatch_size
is used. This is useful to speedup evaluation with a much bigger batch size than training, if enough memory is available.evaluate_training_set
: Whether to include the entire training set during evaluation (default: True).checkpoints_per_epoch
: Number of checkpoints per epoch. For example, 2 -> checkpoints are written every half of an epoch. Note that it is invalid to specify both non-zerosteps_per_checkpoint
and non-zerocheckpoints_per_epoch
(default: 0).steps_per_checkpoint
: How often the model is checkpointed. Also dictates maximum evaluation frequency. If 0 the model is checkpointed after every epoch. (default: 0).regularization_lambda
(default0
): the lambda parameter used for adding regularization loss to the overall loss.regularization_type
(defaultl2
): the type of regularization.learning_rate
(default0.001
): the learning rate to use.reduce_learning_rate_on_plateau
(default0
): if theres a validation set, how many times to reduce the learning rate when a plateau of validation measure is reached.reduce_learning_rate_on_plateau_patience
(default5
): if theres a validation set, number of epochs of patience without an improvement on the validation measure before reducing the learning rate.reduce_learning_rate_on_plateau_rate
(default0.5
): if theres a validation set, the reduction rate of the learning rate.increase_batch_size_on_plateau
(default0
): if theres a validation set, how many times to increase the batch size when a plateau of validation measure is reached.increase_batch_size_on_plateau_patience
(default5
): if theres a validation set, number of epochs of patience without an improvement on the validation measure before increasing the learning rate.increase_batch_size_on_plateau_rate
(default2
): if theres a validation set, the increase rate of the batch size.increase_batch_size_on_plateau_max
(default512
): if theres a validation set, the maximum value of batch size.decay
(defaultfalse
): if to use exponential decay of the learning rate or not.decay_rate
(default0.96
): the rate of the exponential learning rate decay.decay_steps
(default10000
): the number of steps of the exponential learning rate decay.staircase
(defaultfalse
): decays the learning rate at discrete intervals.validation_field
(defaultcombined
): when there is more than one output feature, which one to use for computing if there was an improvement on validation. The measure to use to determine if there was an improvement can be set with thevalidation_measure
parameter. Different data types have different metrics, refer to the datatype-specific section for more details.combined
indicates the use the combination of all features. For instance the combination ofcombined
andloss
as measure uses a decrease in the combined loss of all output features to check for improvement on validation, whilecombined
andaccuracy
considers on how many examples the predictions for all output features were correct (but consider that for some features, for instancenumeric
there is no accuracy measure, so you should useaccuracy
only if all your output features have an accuracy measure).validation_metric
(defaultloss
): the metric to use to determine if there was an improvement. The metric is considered for the output feature specified invalidation_field
. Different data types have different available metrics, refer to the datatype-specific section for more details.bucketing_field
(defaultnull
): when notnull
, when creating batches, instead of shuffling randomly, the length along the last dimension of the matrix of the specified input feature is used for bucketing examples and then randomly shuffled examples from the same bin are sampled. Padding is trimmed to the longest example in the batch. The specified feature should be either asequence
ortext
feature and the encoder encoding it has to bernn
. When used, bucketing improves speed ofrnn
encoding up to 1.5x, depending on the length distribution of the inputs.learning_rate_warmup_epochs
(default1
): Its the number or training epochs where learning rate warmup will be used. It is calculated as described in Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. In the paper the authors suggest6
epochs of warmup, that parameter is suggested for large datasets and big batches.optimizer
(default{type: adam, beta1: 0.9, beta2: 0.999, epsilon: 1e-08}
): which optimizer to use with the relative parameters. The available optimizers are:sgd
(orstochastic_gradient_descent
,gd
,gradient_descent
, they are all the same),adam
,adadelta
,adagrad
,adamax
,ftrl
,nadam
,rmsprop
. Check PyTorch optimizer documentation for a full list of parameters for each optimizer. The optimizer definition can also specify gradient clipping usingclipglobalnorm
,clipnorm
, andclipvalue
.
Optimizer parameters¶
The available optimizers wrap the ones available in PyTorch. For details about the parameters that can be used to configure different optimizers, please refer to the PyTorch documentation.
The learning_rate
parameter the optimizer will use come from the trainer
section.
Other optimizer specific parameters, shown with their Ludwig default settings, follow:
sgd
(orstochastic_gradient_descent
,gd
,gradient_descent
)
momentum: 0.0,
nesterov: false
adam
beta_1: 0.9,
beta_2: 0.999,
epsilon: 1e-08
adadelta
rho: 0.95,
epsilon: 1e-08
adagrad
initial_accumulator_value: 0.1,
epsilon: 1e-07
adamax
beta_1: 0.9,
beta_2: 0.999,
epsilon: 1e-07
ftrl
learning_rate_power: -0.5,
initial_accumulator_value: 0.1,
l1_regularization_strength: 0.0,
l2_regularization_strength: 0.0,
nadam
,
beta_1: 0.9,
beta_2: 0.999,
epsilon: 1e-07
rmsprop
decay: 0.9,
momentum: 0.0,
epsilon: 1e-10,
centered: false
Note
Gradient clipping is also configurable, through optimizers, with the following parameters:
clip_global_norm: 0.5
clipnorm: null
clip_value: null
Configuring training length¶
The length of the training process is configured by:
epochs
(default: 100): One epoch is one pass through the entire dataset. By default,epochs
is 100 which means that the training process will run for a maximum of 100 epochs before terminating.train_steps
(default:None
): The maximum number of steps to train for, using one mini-batch per step. By default this is unset, andepochs
will be used to determine training length.
Tip
In general, it's a good idea to set up a long training runway, relying on
early stopping criteria (early_stop
) to stop training when there
hasn't been any improvement for a long time.
Configuring checkpoint and evaluation frequency¶
Evaluation is run every time the model is checkpointed.
By default, checkpoint-evaluation will occur once every epoch.
The frequency of checkpoint-evaluation can be configured using:
steps_per_checkpoint
(default: 0): everyn
training stepscheckpoints_per_epoch
(default: 0):n
times per epoch
Note
It is invalid to specify both non-zero steps_per_checkpoint
and non-zero
checkpoints_per_epoch
.
Tip
Running evaluation once per epoch is an appropriate fit for small datasets that fit in memory and train quickly. However, this can be a poor fit for unstructured datasets, which tend to be much larger, and train more slowly due to larger models.
Running evaluation too frequently can be wasteful while running evaluation not frequently enough can be uninformative. In large-scale training runs, it's common for evaluation to be configured to run on a sub-epoch time scale, or every few thousand steps.
We recommend configuring evaluation such that new evaluation results are available at least several times an hour. In general, it is not necessary for models to train over the entirety of a dataset, nor evaluate over the entirety of a test set, to produce useful monitoring metrics and signals to indicate model performance.
Increasing throughput¶
Skip evaluation on the training set¶
Consider setting evaluate_training_set=False
, which skips evaluation on the
training set.
Note
Sometimes it can be useful to monitor evaluation metrics on the training set, as a secondary validation set. However, running evaluation on the full training set, when your training set is large, can be a huge computational cost. Turning off training set evaluation will lead to significant gains in training throughput and efficiency.
Increase batch size¶
Users training on GPUs can often increase training throughput by increasing
the batch_size
so that more examples are computed every training step. Set
batch_size
to auto
to use the largest batch size that can fit in memory.