Trainer
Overview¶
The trainer
section of the configuration lets you specify parameters that
configure the training process, like the number of epochs or the learning rate.
By default, the ECD trainer is used.
trainer:
type: trainer
epochs: 100
train_steps: None
early_stop: 5
batch_size: 128
eval_batch_size: null
evaluate_training_set: True
checkpoints_per_epoch: 0
steps_per_checkpoint: 0
regularization_lambda: 0
regularization_type: l2
increase_batch_size_on_plateau: 0
increase_batch_size_on_plateau_patience: 5
increase_batch_size_on_plateau_rate: 2
increase_batch_size_on_plateau_max: 512
learning_rate: 0.001
learning_rate_scheduler:
decay: exponential
decay_steps: 10000
decay_rate: 0.96
staircase: false
reduce_on_plateau: 0
reduce_on_plateau_patience: 5
reduce_on_plateau_rate: 0.5
validation_field: combined
validation_metric: loss
bucketing_field: null
learning_rate_warmup_epochs: 1
optimizer:
type: adam
beta1: 0.9
beta2: 0.999
epsilon: 1e-08
clip_global_norm: 0.5
clipnorm: null
clip_value: null
trainer:
type: lightgbm_trainer
boosting_type: gbdt
num_boost_round: 100
learning_rate: 0.001
max_cat_to_onehot: 4
max_delta_step: 0.0
lambda_l1: 0.0
linear_lambda: 0.0
cat_l2: 10.0
neg_bagging_fraction: 1.0
skip_drop: 0.5
tree_learner: serial
extra_trees: False
lambda_l2: 0.0
min_data_per_group: 100
min_gain_to_split: 0.0
validation_metric: loss
max_cat_threshold: 32
max_bin: 255
early_stop: 5
cegb_penalty_split: 0.0
cegb_tradeoff: 1.0
other_rate: 0.1
path_smooth: 0.0
evaluate_training_set: True
num_leaves: 31
cat_smooth: 10.0
extra_seed: 6
bagging_seed: 3
min_sum_hessian_in_leaf: 0.001
min_data_in_leaf: 20
top_rate: 0.2
feature_fraction_seed: 2
drop_rate: 0.1
xgboost_dart_mode: False
drop_seed: 4
max_depth: -1
feature_fraction_bynode: 1.0
bagging_freq: 0
pos_bagging_fraction: 1.0
feature_fraction: 1.0
eval_batch_size: 128
bagging_fraction: 1.0
uniform_drop: False
validation_field: combined
max_drop: 50
verbose: 0
Trainer parameters¶
type
(defaulttrainer
): Trainer to use for training the model. Must be one of ['trainer', 'ray_legacy_trainer'] - corresponds to name inludwig.trainers.registry.(ray_)trainers_registry
(default: 'trainer')epochs
(default100
): number of epochs the training process will run for.train_steps
(defaultNone
): Maximum number of training steps the training process will run for. If unset, thenepochs
is used to determine training length.early_stop
(default5
): Number of consecutive rounds of evaluation without any improvement on thevalidation_metric
that triggers training to stop. Can be set to -1, which disables early stopping entirely.batch_size
(default128
): size of the batch used for training the model.eval_batch_size
(defaultnull
): size of the batch used for evaluating the model. If it is0
, the same value ofbatch_size
is used. This is useful to speedup evaluation with a much bigger batch size than training, if enough memory is available.evaluate_training_set
: Whether to include the entire training set during evaluation (default: True).checkpoints_per_epoch
: Number of checkpoints per epoch. For example, 2 -> checkpoints are written every half of an epoch. Note that it is invalid to specify both non-zerosteps_per_checkpoint
and non-zerocheckpoints_per_epoch
(default: 0).steps_per_checkpoint
: How often the model is checkpointed. Also dictates maximum evaluation frequency. If 0 the model is checkpointed after every epoch. (default: 0).regularization_lambda
(default0
): the lambda parameter used for adding regularization loss to the overall loss.regularization_type
(defaultl2
): the type of regularization.learning_rate
(default0.001
): the learning rate to use.learning_rate_scheduler
section:reduce_on_plateau
(default0
): if theres a validation set, how many times to reduce the learning rate when a plateau of validation measure is reached.reduce_on_plateau_patience
(default10
): if theres a validation set, number of epochs of patience without an improvement on the validation measure before reducing the learning rate.reduce_on_plateau_rate
(default0.1
): if theres a validation set, the reduction rate of the learning rate.decay
(defaultnull
): one ofnull
,linear
,exponential
. Whether to use one of the aforementioned decay strategies. Specifyingnull
deactivates learning rate decay.decay_rate
(default0.96
): the rate of the exponential learning rate decay.decay_steps
(default10000
): the number of steps of the exponential learning rate decay.staircase
(defaultfalse
): decays the learning rate at discrete intervals.warmup_evaluations
(default0
): Is the number or training epochs where learning rate warmup will be used. It is calculated as described in Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. In the paper the authors suggest6
epochs of warmup, that parameter is suggested for large datasets and big batches.warmup_fraction
(default0
): Fraction of total training steps to warmup the learning rate for.reduce_eval_metric
(default:loss
): Metric used to trigger when we reduce the learning rate whenreduce_on_plateau
> 0.reduce_eval_split
(default:training
): Which dataset split to computereduce_eval_metric
on for reducing the learning rate whenreduce_on_plateau
> 0.
increase_batch_size_on_plateau
(default0
): if theres a validation set, how many times to increase the batch size when a plateau of validation measure is reached.increase_batch_size_on_plateau_patience
(default5
): if theres a validation set, number of epochs of patience without an improvement on the validation measure before increasing the learning rate.increase_batch_size_on_plateau_rate
(default2
): if theres a validation set, the increase rate of the batch size.increase_batch_size_on_plateau_max
(default512
): if theres a validation set, the maximum value of batch size.validation_field
(defaultcombined
): when there is more than one output feature, which one to use for computing if there was an improvement on validation. The measure to use to determine if there was an improvement can be set with thevalidation_measure
parameter. Different data types have different metrics, refer to the datatype-specific section for more details.combined
indicates the use the combination of all features. For instance the combination ofcombined
andloss
as measure uses a decrease in the combined loss of all output features to check for improvement on validation, whilecombined
andaccuracy
considers on how many examples the predictions for all output features were correct (but consider that for some features, for instancenumeric
there is no accuracy measure, so you should useaccuracy
only if all your output features have an accuracy measure).validation_metric
(defaultloss
): the metric to use to determine if there was an improvement. The metric is considered for the output feature specified invalidation_field
. Different data types have different available metrics, refer to the datatype-specific section for more details.bucketing_field
(defaultnull
): when notnull
, when creating batches, instead of shuffling randomly, the length along the last dimension of the matrix of the specified input feature is used for bucketing examples and then randomly shuffled examples from the same bin are sampled. Padding is trimmed to the longest example in the batch. The specified feature should be either asequence
ortext
feature and the encoder encoding it has to bernn
. When used, bucketing improves speed ofrnn
encoding up to 1.5x, depending on the length distribution of the inputs.optimizer
(default{type: adam, beta1: 0.9, beta2: 0.999, epsilon: 1e-08}
): which optimizer to use with the relative parameters. The available optimizers are:sgd
(orstochastic_gradient_descent
,gd
,gradient_descent
, they are all the same),adam
,adadelta
,adagrad
,adamax
,ftrl
,nadam
,rmsprop
. Check PyTorch optimizer documentation for a full list of parameters for each optimizer. The optimizer definition can also specify gradient clipping usingclipglobalnorm
,clipnorm
, andclipvalue
.
See the LightGBM documentation for more details about the available parameters.
type
(defaultlightgbm_trainer
): Trainer to use for training the model. Must be one of ['lightgbm_trainer'] - corresponds to name inludwig.trainers.registry.(ray_)trainers_registry
.boosting_type
(defaultgbdt
): Type of boosting algorithm to use. Options:gbdt
(traditional Gradient Boosting Decision Tree),rf
(random forest),dart
,goss
.num_boost_round
(default100
): Number of boosting rounds to perform.learning_rate
(default0.001
): Boosting learning rate.max_cat_to_onehot
(default4
): Maximum categorical cardinality required before one-hot encoding.max_delta_step
(default0.0
): Used to limit the max output of tree leaves. A negative value means no constraint.lambda_l1
(default0.0
): L1 regularization factor.linear_lambda
(default0.0
): Linear tree regularization.cat_l2
(default10.0
): L2 regularization factor for categorical split.neg_bagging_fraction
(default1.0
): Fraction of negative data to use for bagging.skip_drop
(default0.5
): Probability of skipping the dropout during one boosting iteration. Used only with boosting_type 'dart'.tree_learner
(defaultserial
): Type of tree learner to use.extra_trees
(defaultFalse
): Whether to use extremely randomized trees.lambda_l2
(default0.0
): L2 regularization factor.min_data_per_group
(default100
): Minimum number of data points per categorical group.min_gain_to_split
(default0.0
): Minimum gain to split a leaf.validation_metric
(defaultloss
): Metric used onvalidation_field
, set by default to accuracy.max_cat_threshold
(default32
): Number of split points considered for categorical features.max_bin
(default255
): Maximum number of bins to use for discretizing features.early_stop
(default5
): Number of consecutive rounds of evaluation without any improvement on thevalidation_metric
that triggers training to stop. Can be set to -1, which disables early stopping entirely.cegb_penalty_split
(default0.0
): Cost-effective gradient boosting penalty for splitting a node.cegb_tradeoff
(default1.0
): Cost-effective gradient boosting multiplier for all penalties.other_rate
(default0.1
): The retain ratio of small gradient data. Used only with boosting_type 'goss'.path_smooth
(default0.0
): Smoothing factor applied to tree nodes.evaluate_training_set
(defaultTrue
): Whether to include the entire training set during evaluation.num_leaves
(default31
): Number of leaves to use in the tree.cat_smooth
(default10.0
): Smoothing factor for categorical split.extra_seed
(default6
): Random seed for extremely randomized trees.bagging_seed
(default3
): Random seed for bagging.min_sum_hessian_in_leaf
(default0.001
): Minimum sum of hessians in a leaf.min_data_in_leaf
(default20
): Minimum number of data points in a leaf.top_rate
(default0.2
): The retain ratio of large gradient data. Used only with boosting_type 'goss'.feature_fraction_seed
(default2
): Random seed for feature fraction.drop_rate
(default0.1
): Dropout rate. Used only with boosting_type 'dart'.xgboost_dart_mode
(defaultFalse
): Whether to use xgboost dart mode. Used only with boosting_type 'dart'.drop_seed
(default4
): Random seed to choose dropping models. Used only with boosting_type 'dart'.max_depth
(default-1
): Maximum depth of a tree. A negative value means no limit.feature_fraction_bynode
(default1.0
): Fraction of features to use for each tree node.bagging_freq
(default0
): Frequency of bagging.pos_bagging_fraction
(default1.0
): Fraction of positive data to use for bagging.feature_fraction
(default1.0
): Fraction of features to use.eval_batch_size
(default128
): Size of batch to pass to the model for evaluation.bagging_fraction
(default1.0
): Fraction of data to use for bagging.uniform_drop
(defaultFalse
): Whether to use uniform dropout. Used only with boosting_type 'dart'.validation_field
(defaultcombined
): First output feature, by default it is set as the same field of the first output feature.max_drop
(default50
): Maximum number of dropped trees during one boosting iteration. Used only with boosting_type 'dart'. A negative value means no limit.verbose
(default0
): Verbosity level for GBM trainer.
Optimizer parameters¶
The available optimizers wrap the ones available in PyTorch. For details about the parameters that can be used to configure different optimizers, please refer to the PyTorch documentation.
The learning_rate
parameter used by the optimizer comes from the trainer
section.
Other optimizer specific parameters, shown with their Ludwig default settings, follow:
sgd
(orstochastic_gradient_descent
,gd
,gradient_descent
)
momentum: 0.0,
nesterov: false
adam
beta_1: 0.9,
beta_2: 0.999,
epsilon: 1e-08
adadelta
rho: 0.95,
epsilon: 1e-08
adagrad
initial_accumulator_value: 0.1,
epsilon: 1e-07
adamax
beta_1: 0.9,
beta_2: 0.999,
epsilon: 1e-07
ftrl
learning_rate_power: -0.5,
initial_accumulator_value: 0.1,
l1_regularization_strength: 0.0,
l2_regularization_strength: 0.0,
nadam
,
beta_1: 0.9,
beta_2: 0.999,
epsilon: 1e-07
rmsprop
decay: 0.9,
momentum: 0.0,
epsilon: 1e-10,
centered: false
Note
Gradient clipping is also configurable, through optimizers, with the following parameters:
clip_global_norm: 0.5
clipnorm: null
clip_value: null
No optimizer parameters are available for the LightGBM trainer.
Training length¶
The length of the training process is configured by:
epochs
(default: 100): One epoch is one pass through the entire dataset. By default,epochs
is 100 which means that the training process will run for a maximum of 100 epochs before terminating.train_steps
(default:None
): The maximum number of steps to train for, using one mini-batch per step. By default this is unset, andepochs
will be used to determine training length.
num_boost_round
(default: 100): The number of boosting iterations. By default,num_boost_round
is 100 which means that the training process will run for a maximum of 100 boosting iterations before terminating.
Tip
In general, it's a good idea to set up a long training runway, relying on
early stopping criteria (early_stop
) to stop training when there
hasn't been any improvement for a long time.
Early stopping¶
Machine learning models, when trained for too long, are often prone to overfitting. It's generally a good policy to set up some early stopping criteria as it's not useful to have a model train after it's maximized what it can learn, as to retain it's ability to generalize to new data.
How early stopping works in Ludwig¶
By default, Ludwig sets trainer.early_stop=5
, which means that if there have
been 5
consecutive rounds of evaluation where there hasn't been any
improvement on the validation subset, then training will terminate.
Ludwig runs evaluation once per checkpoint, which by default is once per epoch.
Checkpoint frequency can be configured using checkpoints_per_epoch
(default:
1
) or steps_per_checkpoint
(default: 0
, disabled). See
this section for more details.
Changing the metric early stopping metrics¶
The metric that dictates early stopping is
trainer.validation_field
and trainer.validation_metric
. By default, early
stopping uses the combined loss on the validation subset.
trainer:
validation_field: combined
validation_metric: loss
However, this can be configured to use other metrics. For example, if we had an
output feature called recommended
, then we can configure early stopping on the
output feature accuracy like so:
trainer:
validation_field: recommended
validation_metric: accuracy
Disabling early stopping¶
trainer.early_stop
can be set to -1
, which disables early stopping entirely.
Checkpoint-evaluation frequency¶
Evaluation is run every time the model is checkpointed.
By default, checkpoint-evaluation will occur once every epoch.
The frequency of checkpoint-evaluation can be configured using:
steps_per_checkpoint
(default: 0): everyn
training stepscheckpoints_per_epoch
(default: 0):n
times per epoch
Note
It is invalid to specify both non-zero steps_per_checkpoint
and non-zero
checkpoints_per_epoch
.
Tip
Running evaluation once per epoch is an appropriate fit for small datasets that fit in memory and train quickly. However, this can be a poor fit for unstructured datasets, which tend to be much larger, and train more slowly due to larger models.
Running evaluation too frequently can be wasteful while running evaluation not frequently enough can be uninformative. In large-scale training runs, it's common for evaluation to be configured to run on a sub-epoch time scale, or every few thousand steps.
We recommend configuring evaluation such that new evaluation results are available at least several times an hour. In general, it is not necessary for models to train over the entirety of a dataset, nor evaluate over the entirety of a test set, to produce useful monitoring metrics and signals to indicate model performance.
Increasing throughput¶
Skip evaluation on the training set¶
Consider setting evaluate_training_set=False
, which skips evaluation on the
training set.
Note
Sometimes it can be useful to monitor evaluation metrics on the training set, as a secondary validation set. However, running evaluation on the full training set, when your training set is large, can be a huge computational cost. Turning off training set evaluation will lead to significant gains in training throughput and efficiency.
Increase batch size¶
Users training on GPUs can often increase training throughput by increasing
the batch_size
so that more examples are computed every training step. Set
batch_size
to auto
to use the largest batch size that can fit in memory.