Trainer

Overview¶

The trainer section of the configuration lets you specify parameters that configure the training process, like the number of epochs or the learning rate.

trainer:
    epochs: 100
    train_steps: None
    early_stop: 5
    batch_size: 128
    eval_batch_size: null
    evaluate_training_set: True
    checkpoints_per_epoch: 0
    steps_per_checkpoint: 0
    regularization_lambda: 0
    regularization_type: l2
    learning_rate: 0.001
    reduce_learning_rate_on_plateau: 0
    reduce_learning_rate_on_plateau_patience: 5
    reduce_learning_rate_on_plateau_rate: 0.5
    increase_batch_size_on_plateau: 0
    increase_batch_size_on_plateau_patience: 5
    increase_batch_size_on_plateau_rate: 2
    increase_batch_size_on_plateau_max: 512
    decay: false
    decay_steps: 10000
    decay_rate: 0.96
    staircase: false
    validation_field: combined
    validation_metric: loss
    bucketing_field: null
    learning_rate_warmup_epochs: 1
    optimizer:
        type: adam
        beta1: 0.9
        beta2: 0.999
        epsilon: 1e-08
        clip_global_norm: 0.5
        clipnorm: null
        clip_value: null

Trainer parameters¶

epochs (default 100): number of epochs the training process will run for.
train_steps (default None): Maximum number of training steps the training process will run for. If unset, then epochs is used to determine training length.
early_stop (default 5): Number of consecutive rounds of evaluation without any improvement on the validation_metric that triggers training to stop. Can be set to -1, which disables early stopping entirely.
batch_size (default 128): size of the batch used for training the model.
eval_batch_size (default null): size of the batch used for evaluating the model. If it is 0 or null, the same value of batch_size is used. This is useful to speedup evaluation with a much bigger batch size than training, if enough memory is available.
evaluate_training_set: Whether to include the entire training set during evaluation (default: True).
checkpoints_per_epoch: Number of checkpoints per epoch. For example, 2 -> checkpoints are written every half of an epoch. Note that it is invalid to specify both non-zero steps_per_checkpoint and non-zero checkpoints_per_epoch (default: 0).
steps_per_checkpoint: How often the model is checkpointed. Also dictates maximum evaluation frequency. If 0 the model is checkpointed after every epoch. (default: 0).
regularization_lambda (default 0): the lambda parameter used for adding regularization loss to the overall loss.
regularization_type (default l2): the type of regularization.
learning_rate (default 0.001): the learning rate to use.
reduce_learning_rate_on_plateau (default 0): if theres a validation set, how many times to reduce the learning rate when a plateau of validation measure is reached.
reduce_learning_rate_on_plateau_patience (default 5): if theres a validation set, number of epochs of patience without an improvement on the validation measure before reducing the learning rate.
reduce_learning_rate_on_plateau_rate (default 0.5): if theres a validation set, the reduction rate of the learning rate.
increase_batch_size_on_plateau (default 0): if theres a validation set, how many times to increase the batch size when a plateau of validation measure is reached.
increase_batch_size_on_plateau_patience (default 5): if theres a validation set, number of epochs of patience without an improvement on the validation measure before increasing the learning rate.
increase_batch_size_on_plateau_rate (default 2): if theres a validation set, the increase rate of the batch size.
increase_batch_size_on_plateau_max (default 512): if theres a validation set, the maximum value of batch size.
decay (default false): if to use exponential decay of the learning rate or not.
decay_rate (default 0.96): the rate of the exponential learning rate decay.
decay_steps (default 10000): the number of steps of the exponential learning rate decay.
staircase (default false): decays the learning rate at discrete intervals.
validation_field (default combined): when there is more than one output feature, which one to use for computing if there was an improvement on validation. The measure to use to determine if there was an improvement can be set with the validation_measure parameter. Different data types have different metrics, refer to the datatype-specific section for more details. combined indicates the use the combination of all features. For instance the combination of combined and loss as measure uses a decrease in the combined loss of all output features to check for improvement on validation, while combined and accuracy considers on how many examples the predictions for all output features were correct (but consider that for some features, for instance numeric there is no accuracy measure, so you should use accuracy only if all your output features have an accuracy measure).
validation_metric (default loss): the metric to use to determine if there was an improvement. The metric is considered for the output feature specified in validation_field. Different data types have different available metrics, refer to the datatype-specific section for more details.
bucketing_field (default null): when not null, when creating batches, instead of shuffling randomly, the length along the last dimension of the matrix of the specified input feature is used for bucketing examples and then randomly shuffled examples from the same bin are sampled. Padding is trimmed to the longest example in the batch. The specified feature should be either a sequence or text feature and the encoder encoding it has to be rnn. When used, bucketing improves speed of rnn encoding up to 1.5x, depending on the length distribution of the inputs.
learning_rate_warmup_epochs (default 1): Its the number or training epochs where learning rate warmup will be used. It is calculated as described in Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. In the paper the authors suggest 6 epochs of warmup, that parameter is suggested for large datasets and big batches.
optimizer (default {type: adam, beta1: 0.9, beta2: 0.999, epsilon: 1e-08}): which optimizer to use with the relative parameters. The available optimizers are: sgd (or stochastic_gradient_descent, gd, gradient_descent, they are all the same), adam, adadelta, adagrad, adamax, ftrl, nadam, rmsprop. Check PyTorch optimizer documentation for a full list of parameters for each optimizer. The optimizer definition can also specify gradient clipping using clipglobalnorm, clipnorm, and clipvalue.

Optimizer parameters¶

The available optimizers wrap the ones available in PyTorch. For details about the parameters that can be used to configure different optimizers, please refer to the PyTorch documentation.

The learning_rate parameter the optimizer will use come from the trainer section. Other optimizer specific parameters, shown with their Ludwig default settings, follow:

sgd (or stochastic_gradient_descent, gd, gradient_descent)

momentum: 0.0,
nesterov: false

adam

beta_1: 0.9,
beta_2: 0.999,
epsilon: 1e-08

adadelta

rho: 0.95,
epsilon: 1e-08

adagrad

initial_accumulator_value: 0.1,
epsilon: 1e-07

adamax

beta_1: 0.9,
beta_2: 0.999,
epsilon: 1e-07

ftrl

learning_rate_power: -0.5,
initial_accumulator_value: 0.1,
l1_regularization_strength: 0.0,
l2_regularization_strength: 0.0,

nadam,

beta_1: 0.9,
beta_2: 0.999,
epsilon: 1e-07

rmsprop

decay: 0.9,
momentum: 0.0,
epsilon: 1e-10,
centered: false

Note

Gradient clipping is also configurable, through optimizers, with the following parameters:

clip_global_norm: 0.5
clipnorm: null
clip_value: null

Configuring training length¶

The length of the training process is configured by:

epochs (default: 100): One epoch is one pass through the entire dataset. By default, epochs is 100 which means that the training process will run for a maximum of 100 epochs before terminating.
train_steps (default: None): The maximum number of steps to train for, using one mini-batch per step. By default this is unset, and epochs will be used to determine training length.

Tip

In general, it's a good idea to set up a long training runway, relying on early stopping criteria (early_stop) to stop training when there hasn't been any improvement for a long time.

Configuring checkpoint and evaluation frequency¶

Evaluation is run every time the model is checkpointed.

By default, checkpoint-evaluation will occur once every epoch.

The frequency of checkpoint-evaluation can be configured using:

steps_per_checkpoint (default: 0): every n training steps
checkpoints_per_epoch (default: 0): n times per epoch

Note

It is invalid to specify both non-zero steps_per_checkpoint and non-zero checkpoints_per_epoch.

Tip

Running evaluation once per epoch is an appropriate fit for small datasets that fit in memory and train quickly. However, this can be a poor fit for unstructured datasets, which tend to be much larger, and train more slowly due to larger models.

Running evaluation too frequently can be wasteful while running evaluation not frequently enough can be uninformative. In large-scale training runs, it's common for evaluation to be configured to run on a sub-epoch time scale, or every few thousand steps.

We recommend configuring evaluation such that new evaluation results are available at least several times an hour. In general, it is not necessary for models to train over the entirety of a dataset, nor evaluate over the entirety of a test set, to produce useful monitoring metrics and signals to indicate model performance.

Increasing throughput¶

Skip evaluation on the training set¶

Consider setting evaluate_training_set=False, which skips evaluation on the training set.

Note

Sometimes it can be useful to monitor evaluation metrics on the training set, as a secondary validation set. However, running evaluation on the full training set, when your training set is large, can be a huge computational cost. Turning off training set evaluation will lead to significant gains in training throughput and efficiency.

Increase batch size¶

Users training on GPUs can often increase training throughput by increasing the batch_size so that more examples are computed every training step. Set batch_size to auto to use the largest batch size that can fit in memory.