Trainer
Overview¶
The trainer
section of the configuration lets you specify parameters that
configure the training process, like the number of epochs or the learning rate.
By default, the ECD trainer is used.
trainer:
early_stop: 5
learning_rate: 0.001
epochs: 100
batch_size: auto
optimizer:
type: adam
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
amsgrad: false
regularization_type: l2
use_mixed_precision: false
compile: false
checkpoints_per_epoch: 0
effective_batch_size: auto
gradient_accumulation_steps: auto
regularization_lambda: 0.0
enable_gradient_checkpointing: false
validation_field: null
validation_metric: null
train_steps: null
steps_per_checkpoint: 0
max_batch_size: 1099511627776
eval_batch_size: null
evaluate_training_set: false
should_shuffle: true
increase_batch_size_on_plateau: 0
increase_batch_size_on_plateau_patience: 5
increase_batch_size_on_plateau_rate: 2.0
increase_batch_size_eval_metric: loss
increase_batch_size_eval_split: training
gradient_clipping:
clipglobalnorm: 0.5
clipnorm: null
clipvalue: null
learning_rate_scaling: linear
bucketing_field: null
skip_all_evaluation: false
enable_profiling: false
profiler:
wait: 1
warmup: 1
active: 3
repeat: 5
skip_first: 0
learning_rate_scheduler:
decay: null
decay_rate: 0.96
decay_steps: 10000
staircase: false
reduce_on_plateau: 0
reduce_on_plateau_patience: 10
reduce_on_plateau_rate: 0.1
warmup_evaluations: 0
warmup_fraction: 0.0
reduce_eval_metric: loss
reduce_eval_split: training
t_0: null
t_mult: 1
eta_min: 0
trainer:
type: finetune
early_stop: 5
learning_rate: 0.001
epochs: 100
batch_size: auto
optimizer:
type: adam
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
amsgrad: false
regularization_type: l2
use_mixed_precision: false
compile: false
checkpoints_per_epoch: 0
effective_batch_size: auto
gradient_accumulation_steps: auto
regularization_lambda: 0.0
enable_gradient_checkpointing: false
validation_field: null
validation_metric: null
train_steps: null
steps_per_checkpoint: 0
max_batch_size: 1099511627776
evaluate_training_set: false
should_shuffle: true
increase_batch_size_on_plateau: 0
increase_batch_size_on_plateau_patience: 5
increase_batch_size_on_plateau_rate: 2.0
increase_batch_size_eval_metric: loss
increase_batch_size_eval_split: training
gradient_clipping:
clipglobalnorm: 0.5
clipnorm: null
clipvalue: null
learning_rate_scaling: linear
bucketing_field: null
skip_all_evaluation: false
enable_profiling: false
profiler:
wait: 1
warmup: 1
active: 3
repeat: 5
skip_first: 0
learning_rate_scheduler:
decay: null
decay_rate: 0.96
decay_steps: 10000
staircase: false
reduce_on_plateau: 0
reduce_on_plateau_patience: 10
reduce_on_plateau_rate: 0.1
warmup_evaluations: 0
warmup_fraction: 0.0
reduce_eval_metric: loss
reduce_eval_split: training
t_0: null
t_mult: 1
eta_min: 0
eval_batch_size: 2
base_learning_rate: 0.0
trainer:
early_stop: 5
learning_rate: 0.03
max_depth: 18
boosting_type: gbdt
bagging_fraction: 0.8
feature_fraction: 0.75
extra_trees: false
lambda_l1: 0.25
lambda_l2: 0.2
drop_rate: 0.1
tree_learner: serial
boosting_rounds_per_checkpoint: 50
num_boost_round: 1000
num_leaves: 82
min_data_in_leaf: 20
pos_bagging_fraction: 1.0
neg_bagging_fraction: 1.0
bagging_freq: 1
bagging_seed: 3
feature_fraction_bynode: 1.0
feature_fraction_seed: 2
extra_seed: 6
linear_lambda: 0.0
max_drop: 50
skip_drop: 0.5
uniform_drop: false
drop_seed: 4
validation_field: null
validation_metric: null
eval_batch_size: 1048576
evaluate_training_set: false
min_sum_hessian_in_leaf: 0.001
max_delta_step: 0.0
min_gain_to_split: 0.03
xgboost_dart_mode: false
top_rate: 0.2
other_rate: 0.1
min_data_per_group: 100
max_cat_threshold: 32
cat_l2: 10.0
cat_smooth: 10.0
max_cat_to_onehot: 4
cegb_tradeoff: 1.0
cegb_penalty_split: 0.0
path_smooth: 0.0
verbose: -1
max_bin: 255
feature_pre_filter: true
skip_all_evaluation: false
enable_profiling: false
profiler:
wait: 1
warmup: 1
active: 3
repeat: 5
skip_first: 0
Trainer parameters¶
early_stop
(default:5
) : Number of consecutive rounds of evaluation without any improvement on thevalidation_metric
that triggers training to stop. Can be set to -1, which disables early stopping entirely.learning_rate
(default:0.001
) : Controls how much to change the model in response to the estimated error each time the model weights are updated. If 'auto', the optimal learning rate is estimated by choosing the learning rate that produces the smallest non-diverging gradient update.epochs
(default:100
) : Number of epochs the algorithm is intended to be run over. Overridden iftrain_steps
is setbatch_size
(default:auto
) : The number of training examples utilized in one training step of the model. If ’auto’, the batch size that maximized training throughput (samples / sec) will be used. For CPU training, the tuned batch size is capped at 128 as throughput benefits of large batch sizes are less noticeable without a GPU.optimizer
(default:{"type": "adam"}
) : Optimizer type and its parameters. The optimizer is responsble for applying the gradients computed from the loss during backpropagation as updates to the model weights. See Optimizer parameters for details.regularization_type
(default:l2
) : Type of regularization. Options:l1
,l2
,l1_l2
,null
.use_mixed_precision
(default:false
) : Enable automatic mixed-precision (AMP) during training. Options:true
,false
.compile
(default:false
) : Whether to compile the model before training. Options:true
,false
.checkpoints_per_epoch
(default:0
): Number of checkpoints per epoch. For example, 2 -> checkpoints are written every half of an epoch. Note that it is invalid to specify both non-zerosteps_per_checkpoint
and non-zerocheckpoints_per_epoch
.effective_batch_size
(default:auto
): The effective batch size is the total number of samples used to compute a single gradient update to the model weights. This differs frombatch_size
by takinggradient_accumulation_steps
and number of training worker processes into account. In practice,effective_batch_size = batch_size * gradient_accumulation_steps * num_workers
. If 'auto', the effective batch size is derivied implicitly frombatch_size
, but if set explicitly, then one ofbatch_size
orgradient_accumulation_steps
must be set to something other than 'auto', and consequently will be set following the formula given above.gradient_accumulation_steps
(default:auto
): Number of steps to accumulate gradients over before performing a weight update.regularization_lambda
(default:0.0
): Strength of the regularization.enable_gradient_checkpointing
(default:false
): Whether to enable gradient checkpointing, which trades compute for memory.This is useful for training very deep models with limited memory. Options:true
,false
.validation_field
(default:null
): The field for which thevalidation_metric
is used for validation-related mechanics like early stopping, parameter change plateaus, as well as what hyperparameter optimization uses to determine the best trial. If unset (default), the first output feature is used. If explicitly specified, neithervalidation_field
norvalidation_metric
are overwritten.validation_metric
(default:null
): Metric fromvalidation_field
that is used. If validation_field is not explicitly specified, this is overwritten to be the first output feature type'sdefault_validation_metric
, consistent with validation_field. If the validation_metric is specified, then we will use the first output feature that produces this metric as thevalidation_field
.train_steps
(default:null
): Maximum number of training steps the algorithm is intended to be run over. Unset by default. If set, will overrideepochs
and if left unset thenepochs
is used to determine training length.steps_per_checkpoint
(default:0
): How often the model is checkpointed. Also dictates maximum evaluation frequency. If 0 the model is checkpointed after every epoch.max_batch_size
(default:1099511627776
): Auto batch size tuning and increasing batch size on plateau will be capped at this value. The default value is 2^40.eval_batch_size
(default:null
): Size of batch to pass to the model for evaluation. If it is0
orNone
, the same value ofbatch_size
is used. This is useful to speedup evaluation with a much bigger batch size than training, if enough memory is available. If ’auto’, the biggest batch size (power of 2) that can fit in memory will be used.evaluate_training_set
(default:false
): Whether to evaluate on the entire training set during evaluation. By default, training metrics will be computed at the end of each training step, and accumulated up to the evaluation phase. In practice, computing training set metrics during training is up to 30% faster than running a separate evaluation pass over the training set, but results in more noisy training metrics, particularly during the earlier epochs. It's recommended to only set this to True if you need very exact training set metrics, and are willing to pay a significant performance penalty for them. Options:true
,false
.should_shuffle
(default:true
): Whether to shuffle batches during training when true. Options:true
,false
.increase_batch_size_on_plateau
(default:0
): The number of times to increase the batch size on a plateau.increase_batch_size_on_plateau_patience
(default:5
): How many epochs to wait for before increasing the batch size.increase_batch_size_on_plateau_rate
(default:2.0
): Rate at which the batch size increases.increase_batch_size_eval_metric
(default:loss
): Which metric to listen on for increasing the batch size.increase_batch_size_eval_split
(default:training
): Which dataset split to listen on for increasing the batch size.gradient_clipping
: Parameter values for gradient clipping.gradient_clipping.clipglobalnorm
(default:0.5
): Maximum allowed norm of the gradientsgradient_clipping.clipnorm
(default:null
): Maximum allowed norm of the gradientsgradient_clipping.clipvalue
(default:null
): Maximum allowed value of the gradientslearning_rate_scaling
(default:linear
): Scale by which to increase the learning rate as the number of distributed workers increases. Traditionally the learning rate is scaled linearly with the number of workers to reflect the proportion by which the effective batch size is increased. For very large batch sizes, a softer square-root scale can sometimes lead to better model performance. If the learning rate is hand-tuned for a given number of workers, setting this value to constant can be used to disable scale-up. Options:constant
,sqrt
,linear
.bucketing_field
(default:null
): Feature to use for bucketing datapointsskip_all_evaluation
(default:false
): Whether to skip evaluation entirely. If you are training a model with a well-known configuration on a well-known dataset and are confident about the expected results, you might skip all evaluation. Moreover, evaluating a model, especially on large validation or test sets, can be time-consuming. Options:true
,false
.enable_profiling
(default:false
): Whether to enable profiling of the training process using torch.profiler.profile. Options:true
,false
.profiler
: Parameter values for profiling config.profiler.wait
(default:1
): The number of steps to wait profiling.profiler.warmup
(default:1
): The number of steps for profiler warmup after waiting finishes.profiler.active
(default:3
): The number of steps that are actively recorded. Values more than 10 wil dramatically slow down tensorboard loading.profiler.repeat
(default:5
): The optional number of profiling cycles. Use 0 to profile the entire training run.profiler.skip_first
(default:0
): The number of steps to skip in the beginning of training.learning_rate_scheduler
: Parameter values for learning rate scheduler.learning_rate_scheduler.decay
(default:null
) : Turn on decay of the learning rate. Options:linear
,exponential
,cosine
,null
.learning_rate_scheduler.decay_rate
(default:0.96
): Decay per epoch (%): Factor to decrease the Learning rate.learning_rate_scheduler.decay_steps
(default:10000
): The number of steps to take in the exponential learning rate decay.learning_rate_scheduler.staircase
(default:false
): Decays the learning rate at discrete intervals. Options:true
,false
.learning_rate_scheduler.reduce_on_plateau
(default:0
) : How many times to reduce the learning rate when the algorithm hits a plateau (i.e. the performance on the training set does not improve)learning_rate_scheduler.reduce_on_plateau_patience
(default:10
): How many evaluation steps have to pass before the learning rate reduces whenreduce_on_plateau > 0
.learning_rate_scheduler.reduce_on_plateau_rate
(default:0.1
): Rate at which we reduce the learning rate whenreduce_on_plateau > 0
.learning_rate_scheduler.warmup_evaluations
(default:0
): Number of evaluation steps to warmup the learning rate for.learning_rate_scheduler.warmup_fraction
(default:0.0
): Fraction of total training steps to warmup the learning rate for.learning_rate_scheduler.reduce_eval_metric
(default:loss
): Metric plateau used to trigger when we reduce the learning rate whenreduce_on_plateau > 0
.learning_rate_scheduler.reduce_eval_split
(default:training
): Which dataset split to listen on for reducing the learning rate whenreduce_on_plateau > 0
.learning_rate_scheduler.t_0
(default:null
): Number of steps before the first restart for cosine annealing decay. If not specified, it will be set tosteps_per_checkpoint
.learning_rate_scheduler.t_mult
(default:1
): Period multiplier after each restart for cosine annealing decay. Defaults to 1, i.e., restart everyt_0
steps. If set to a larger value, the period between restarts increases by that multiplier. For e.g., if t_mult is 2, then the periods would be: t_0, 2t_0, 2^2t_0, 2^3*t_0, etc.learning_rate_scheduler.eta_min
(default:0
): Minimum learning rate allowed for cosine annealing decay. Default: 0.
type
(default:finetune
): Options:finetune
.early_stop
(default:5
) : Number of consecutive rounds of evaluation without any improvement on thevalidation_metric
that triggers training to stop. Can be set to -1, which disables early stopping entirely.learning_rate
(default:0.001
) : Controls how much to change the model in response to the estimated error each time the model weights are updated. If 'auto', the optimal learning rate is estimated by choosing the learning rate that produces the smallest non-diverging gradient update.epochs
(default:100
) : Number of epochs the algorithm is intended to be run over. Overridden iftrain_steps
is setbatch_size
(default:auto
) : The number of training examples utilized in one training step of the model. If ’auto’, the batch size that maximized training throughput (samples / sec) will be used. For CPU training, the tuned batch size is capped at 128 as throughput benefits of large batch sizes are less noticeable without a GPU.optimizer
(default:{"type": "adam"}
) : Optimizer type and its parameters. The optimizer is responsble for applying the gradients computed from the loss during backpropagation as updates to the model weights. See Optimizer parameters for details.regularization_type
(default:l2
) : Type of regularization. Options:l1
,l2
,l1_l2
,null
.use_mixed_precision
(default:false
) : Enable automatic mixed-precision (AMP) during training. Options:true
,false
.compile
(default:false
) : Whether to compile the model before training. Options:true
,false
.checkpoints_per_epoch
(default:0
): Number of checkpoints per epoch. For example, 2 -> checkpoints are written every half of an epoch. Note that it is invalid to specify both non-zerosteps_per_checkpoint
and non-zerocheckpoints_per_epoch
.effective_batch_size
(default:auto
): The effective batch size is the total number of samples used to compute a single gradient update to the model weights. This differs frombatch_size
by takinggradient_accumulation_steps
and number of training worker processes into account. In practice,effective_batch_size = batch_size * gradient_accumulation_steps * num_workers
. If 'auto', the effective batch size is derivied implicitly frombatch_size
, but if set explicitly, then one ofbatch_size
orgradient_accumulation_steps
must be set to something other than 'auto', and consequently will be set following the formula given above.gradient_accumulation_steps
(default:auto
): Number of steps to accumulate gradients over before performing a weight update.regularization_lambda
(default:0.0
): Strength of the regularization.enable_gradient_checkpointing
(default:false
): Whether to enable gradient checkpointing, which trades compute for memory.This is useful for training very deep models with limited memory. Options:true
,false
.validation_field
(default:null
): The field for which thevalidation_metric
is used for validation-related mechanics like early stopping, parameter change plateaus, as well as what hyperparameter optimization uses to determine the best trial. If unset (default), the first output feature is used. If explicitly specified, neithervalidation_field
norvalidation_metric
are overwritten.validation_metric
(default:null
): Metric fromvalidation_field
that is used. If validation_field is not explicitly specified, this is overwritten to be the first output feature type'sdefault_validation_metric
, consistent with validation_field. If the validation_metric is specified, then we will use the first output feature that produces this metric as thevalidation_field
.train_steps
(default:null
): Maximum number of training steps the algorithm is intended to be run over. Unset by default. If set, will overrideepochs
and if left unset thenepochs
is used to determine training length.steps_per_checkpoint
(default:0
): How often the model is checkpointed. Also dictates maximum evaluation frequency. If 0 the model is checkpointed after every epoch.max_batch_size
(default:1099511627776
): Auto batch size tuning and increasing batch size on plateau will be capped at this value. The default value is 2^40.evaluate_training_set
(default:false
): Whether to evaluate on the entire training set during evaluation. By default, training metrics will be computed at the end of each training step, and accumulated up to the evaluation phase. In practice, computing training set metrics during training is up to 30% faster than running a separate evaluation pass over the training set, but results in more noisy training metrics, particularly during the earlier epochs. It's recommended to only set this to True if you need very exact training set metrics, and are willing to pay a significant performance penalty for them. Options:true
,false
.should_shuffle
(default:true
): Whether to shuffle batches during training when true. Options:true
,false
.increase_batch_size_on_plateau
(default:0
): The number of times to increase the batch size on a plateau.increase_batch_size_on_plateau_patience
(default:5
): How many epochs to wait for before increasing the batch size.increase_batch_size_on_plateau_rate
(default:2.0
): Rate at which the batch size increases.increase_batch_size_eval_metric
(default:loss
): Which metric to listen on for increasing the batch size.increase_batch_size_eval_split
(default:training
): Which dataset split to listen on for increasing the batch size.gradient_clipping
: Parameter values for gradient clipping.gradient_clipping.clipglobalnorm
(default:0.5
): Maximum allowed norm of the gradientsgradient_clipping.clipnorm
(default:null
): Maximum allowed norm of the gradientsgradient_clipping.clipvalue
(default:null
): Maximum allowed value of the gradientslearning_rate_scaling
(default:linear
): Scale by which to increase the learning rate as the number of distributed workers increases. Traditionally the learning rate is scaled linearly with the number of workers to reflect the proportion by which the effective batch size is increased. For very large batch sizes, a softer square-root scale can sometimes lead to better model performance. If the learning rate is hand-tuned for a given number of workers, setting this value to constant can be used to disable scale-up. Options:constant
,sqrt
,linear
.bucketing_field
(default:null
): Feature to use for bucketing datapointsskip_all_evaluation
(default:false
): Whether to skip evaluation entirely. If you are training a model with a well-known configuration on a well-known dataset and are confident about the expected results, you might skip all evaluation. Moreover, evaluating a model, especially on large validation or test sets, can be time-consuming. Options:true
,false
.enable_profiling
(default:false
): Whether to enable profiling of the training process using torch.profiler.profile. Options:true
,false
.profiler
: Parameter values for profiling config.profiler.wait
(default:1
): The number of steps to wait profiling.profiler.warmup
(default:1
): The number of steps for profiler warmup after waiting finishes.profiler.active
(default:3
): The number of steps that are actively recorded. Values more than 10 wil dramatically slow down tensorboard loading.profiler.repeat
(default:5
): The optional number of profiling cycles. Use 0 to profile the entire training run.profiler.skip_first
(default:0
): The number of steps to skip in the beginning of training.learning_rate_scheduler
: Parameter values for learning rate scheduler.learning_rate_scheduler.decay
(default:null
) : Turn on decay of the learning rate. Options:linear
,exponential
,cosine
,null
.learning_rate_scheduler.decay_rate
(default:0.96
): Decay per epoch (%): Factor to decrease the Learning rate.learning_rate_scheduler.decay_steps
(default:10000
): The number of steps to take in the exponential learning rate decay.learning_rate_scheduler.staircase
(default:false
): Decays the learning rate at discrete intervals. Options:true
,false
.learning_rate_scheduler.reduce_on_plateau
(default:0
) : How many times to reduce the learning rate when the algorithm hits a plateau (i.e. the performance on the training set does not improve)learning_rate_scheduler.reduce_on_plateau_patience
(default:10
): How many evaluation steps have to pass before the learning rate reduces whenreduce_on_plateau > 0
.learning_rate_scheduler.reduce_on_plateau_rate
(default:0.1
): Rate at which we reduce the learning rate whenreduce_on_plateau > 0
.learning_rate_scheduler.warmup_evaluations
(default:0
): Number of evaluation steps to warmup the learning rate for.learning_rate_scheduler.warmup_fraction
(default:0.0
): Fraction of total training steps to warmup the learning rate for.learning_rate_scheduler.reduce_eval_metric
(default:loss
): Metric plateau used to trigger when we reduce the learning rate whenreduce_on_plateau > 0
.learning_rate_scheduler.reduce_eval_split
(default:training
): Which dataset split to listen on for reducing the learning rate whenreduce_on_plateau > 0
.learning_rate_scheduler.t_0
(default:null
): Number of steps before the first restart for cosine annealing decay. If not specified, it will be set tosteps_per_checkpoint
.learning_rate_scheduler.t_mult
(default:1
): Period multiplier after each restart for cosine annealing decay. Defaults to 1, i.e., restart everyt_0
steps. If set to a larger value, the period between restarts increases by that multiplier. For e.g., if t_mult is 2, then the periods would be: t_0, 2t_0, 2^2t_0, 2^3*t_0, etc.learning_rate_scheduler.eta_min
(default:0
): Minimum learning rate allowed for cosine annealing decay. Default: 0.eval_batch_size
(default:2
): Batch size used for evaluation in the LLM trainer.base_learning_rate
(default:0.0
): Base learning rate used for training in the LLM trainer.
See the LightGBM documentation for more details about the available parameters.
early_stop
(default:5
) : Number of consecutive rounds of evaluation without any improvement on thevalidation_metric
that triggers training to stop. Can be set to -1, which disables early stopping entirely.learning_rate
(default:0.03
) : Controls how much to change the model in response to the estimated error each time the model weights are updated.max_depth
(default:18
) : Maximum depth of a tree in the GBM trainer. A negative value means no limit.boosting_type
(default:gbdt
) : Type of boosting algorithm to use with GBM trainer. Options:gbdt
,dart
.bagging_fraction
(default:0.8
) : Fraction of data to use for bagging with GBM trainer.feature_fraction
(default:0.75
) : Fraction of features to use in the GBM trainer.extra_trees
(default:false
) : Whether to use extremely randomized trees in the GBM trainer. Options:true
,false
.lambda_l1
(default:0.25
) : L1 regularization factor for the GBM trainer.lambda_l2
(default:0.2
) : L2 regularization factor for the GBM trainer.drop_rate
(default:0.1
): Dropout rate for the GBM trainer. Used only with boosting_type 'dart'.tree_learner
(default:serial
): Type of tree learner to use with GBM trainer. Options:serial
,feature
,data
,voting
.boosting_rounds_per_checkpoint
(default:50
): Number of boosting rounds per checkpoint / evaluation round.num_boost_round
(default:1000
): Number of boosting rounds to perform with GBM trainer.num_leaves
(default:82
): Number of leaves to use in the tree with GBM trainer.min_data_in_leaf
(default:20
): Minimum number of data points in a leaf with GBM trainer.pos_bagging_fraction
(default:1.0
): Fraction of positive data to use for bagging with GBM trainer.neg_bagging_fraction
(default:1.0
): Fraction of negative data to use for bagging with GBM trainer.bagging_freq
(default:1
): Frequency of bagging with GBM trainer.bagging_seed
(default:3
): Random seed for bagging with GBM trainer.feature_fraction_bynode
(default:1.0
): Fraction of features to use for each tree node with GBM trainer.feature_fraction_seed
(default:2
): Random seed for feature fraction with GBM trainer.extra_seed
(default:6
): Random seed for extremely randomized trees in the GBM trainer.linear_lambda
(default:0.0
): Linear tree regularization in the GBM trainer.max_drop
(default:50
): Maximum number of dropped trees during one boosting iteration. Used only with boosting_type 'dart'. A negative value means no limit.skip_drop
(default:0.5
): Probability of skipping the dropout during one boosting iteration. Used only with boosting_type 'dart'.uniform_drop
(default:false
): Whether to use uniform dropout in the GBM trainer. Used only with boosting_type 'dart'. Options:true
,false
.drop_seed
(default:4
): Random seed to choose dropping models in the GBM trainer. Used only with boosting_type 'dart'.validation_field
(default:null
): First output feature, by default it is set as the same field of the first output feature.validation_metric
(default:null
): Metric used onvalidation_field
, set by default to the output feature type'sdefault_validation_metric
.eval_batch_size
(default:1048576
): Size of batch to pass to the model for evaluation.evaluate_training_set
(default:false
): Whether to evaluate on the entire training set during evaluation. By default, training metrics will be computed at the end of each training step, and accumulated up to the evaluation phase. In practice, computing training set metrics during training is up to 30% faster than running a separate evaluation pass over the training set, but results in more noisy training metrics, particularly during the earlier epochs. It's recommended to only set this to True if you need very exact training set metrics, and are willing to pay a significant performance penalty for them. Options:true
,false
.min_sum_hessian_in_leaf
(default:0.001
): Minimum sum of hessians in a leaf with GBM trainer.max_delta_step
(default:0.0
): Used to limit the max output of tree leaves in the GBM trainer. A negative value means no constraint.min_gain_to_split
(default:0.03
): Minimum gain to split a leaf in the GBM trainer.xgboost_dart_mode
(default:false
): Whether to use xgboost dart mode in the GBM trainer. Used only with boosting_type 'dart'. Options:true
,false
.top_rate
(default:0.2
): The retain ratio of large gradient data in the GBM trainer. Used only with boosting_type 'goss'.other_rate
(default:0.1
): The retain ratio of small gradient data in the GBM trainer. Used only with boosting_type 'goss'.min_data_per_group
(default:100
): Minimum number of data points per categorical group for the GBM trainer.max_cat_threshold
(default:32
): Number of split points considered for categorical features for the GBM trainer.cat_l2
(default:10.0
): L2 regularization factor for categorical split in the GBM trainer.cat_smooth
(default:10.0
): Smoothing factor for categorical split in the GBM trainer.max_cat_to_onehot
(default:4
): Maximum categorical cardinality required before one-hot encoding in the GBM trainer.cegb_tradeoff
(default:1.0
): Cost-effective gradient boosting multiplier for all penalties in the GBM trainer.cegb_penalty_split
(default:0.0
): Cost-effective gradient boosting penalty for splitting a node in the GBM trainer.path_smooth
(default:0.0
): Smoothing factor applied to tree nodes in the GBM trainer.verbose
(default:-1
): Verbosity level for GBM trainer. Options:-1
,0
,1
,2
.max_bin
(default:255
): Maximum number of bins to use for discretizing features with GBM trainer.feature_pre_filter
(default:true
): Whether to ignore features that are unsplittable based on min_data_in_leaf in the GBM trainer. Options:true
,false
.skip_all_evaluation
(default:false
): Whether to skip evaluation entirely. If you are training a model with a well-known configuration on a well-known dataset and are confident about the expected results, you might skip all evaluation. Moreover, evaluating a model, especially on large validation or test sets, can be time-consuming. Options:true
,false
.enable_profiling
(default:false
): Whether to enable profiling of the training process using torch.profiler.profile. Options:true
,false
.profiler
: Parameter values for profiling config.profiler.wait
(default:1
): The number of steps to wait profiling.profiler.warmup
(default:1
): The number of steps for profiler warmup after waiting finishes.profiler.active
(default:3
): The number of steps that are actively recorded. Values more than 10 wil dramatically slow down tensorboard loading.profiler.repeat
(default:5
): The optional number of profiling cycles. Use 0 to profile the entire training run.profiler.skip_first
(default:0
): The number of steps to skip in the beginning of training.
Optimizer parameters¶
The available optimizers wrap the ones available in PyTorch. For details about the parameters that can be used to configure different optimizers, please refer to the PyTorch documentation.
The learning_rate
parameter used by the optimizer comes from the trainer
section.
Other optimizer specific parameters, shown with their Ludwig default settings, follow:
sgd¶
optimizer:
type: sgd
momentum: 0.0
weight_decay: 0.0
dampening: 0.0
nesterov: false
momentum
(default:0.0
): Momentum factor.weight_decay
(default:0.0
): Weight decay ($L2$ penalty).dampening
(default:0.0
): Dampening for momentum.nesterov
(default:false
): Enables Nesterov momentum. Options:true
,false
.
sgd_8bit¶
optimizer:
type: sgd_8bit
momentum: 0.0
weight_decay: 0.0
dampening: 0.0
nesterov: false
block_wise: false
percentile_clipping: 100
momentum
(default:0.0
): Momentum factor.weight_decay
(default:0.0
): Weight decay ($L2$ penalty).dampening
(default:0.0
): Dampening for momentum.nesterov
(default:false
): Enables Nesterov momentum. Options:true
,false
.block_wise
(default:false
): Whether to use block wise update. Options:true
,false
.percentile_clipping
(default:100
): Percentile clipping.
lbfgs¶
optimizer:
type: lbfgs
max_iter: 20
max_eval: null
tolerance_grad: 1.0e-07
tolerance_change: 1.0e-09
history_size: 100
line_search_fn: null
max_iter
(default:20
): Maximum number of iterations per optimization step.max_eval
(default:null
): Maximum number of function evaluations per optimization step. Default:max_iter
* 1.25.tolerance_grad
(default:1e-07
): Termination tolerance on first order optimality.tolerance_change
(default:1e-09
): Termination tolerance on function value/parameter changes.history_size
(default:100
): Update history size.line_search_fn
(default:null
): Line search function to use. Options:strong_wolfe
,null
.
adam¶
optimizer:
type: adam
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
amsgrad: false
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.weight_decay
(default:0.0
): Weight decay (L2 penalty).amsgrad
(default:false
): Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam and Beyond'. Options:true
,false
.
adam_8bit¶
optimizer:
type: adam_8bit
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
amsgrad: false
block_wise: true
percentile_clipping: 100
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.weight_decay
(default:0.0
): Weight decay (L2 penalty).amsgrad
(default:false
): Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam and Beyond'. Options:true
,false
.block_wise
(default:true
): Whether to use block wise update. Options:true
,false
.percentile_clipping
(default:100
): Percentile clipping.
paged_adam¶
optimizer:
type: paged_adam
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
amsgrad: false
block_wise: true
percentile_clipping: 100
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.weight_decay
(default:0.0
): Weight decay (L2 penalty).amsgrad
(default:false
): Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam and Beyond'. Options:true
,false
.block_wise
(default:true
): Whether to use block wise update. Options:true
,false
.percentile_clipping
(default:100
): Percentile clipping.
paged_adam_8bit¶
optimizer:
type: paged_adam_8bit
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
amsgrad: false
block_wise: true
percentile_clipping: 100
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.weight_decay
(default:0.0
): Weight decay (L2 penalty).amsgrad
(default:false
): Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam and Beyond'. Options:true
,false
.block_wise
(default:true
): Whether to use block wise update. Options:true
,false
.percentile_clipping
(default:100
): Percentile clipping.
adamw¶
optimizer:
type: adamw
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
amsgrad: false
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.weight_decay
(default:0.0
): Weight decay ($L2$ penalty).amsgrad
(default:false
): Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam and Beyond'. Options:true
,false
.
adamw_8bit¶
optimizer:
type: adamw_8bit
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
amsgrad: false
block_wise: true
percentile_clipping: 100
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.weight_decay
(default:0.0
): Weight decay ($L2$ penalty).amsgrad
(default:false
): Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam and Beyond'. Options:true
,false
.block_wise
(default:true
): Whether to use block wise update. Options:true
,false
.percentile_clipping
(default:100
): Percentile clipping.
paged_adamw¶
optimizer:
type: paged_adamw
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
amsgrad: false
block_wise: true
percentile_clipping: 100
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.weight_decay
(default:0.0
): Weight decay ($L2$ penalty).amsgrad
(default:false
): Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam and Beyond'. Options:true
,false
.block_wise
(default:true
): Whether to use block wise update. Options:true
,false
.percentile_clipping
(default:100
): Percentile clipping.
paged_adamw_8bit¶
optimizer:
type: paged_adamw_8bit
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
amsgrad: false
block_wise: true
percentile_clipping: 100
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.weight_decay
(default:0.0
): Weight decay ($L2$ penalty).amsgrad
(default:false
): Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam and Beyond'. Options:true
,false
.block_wise
(default:true
): Whether to use block wise update. Options:true
,false
.percentile_clipping
(default:100
): Percentile clipping.
adadelta¶
optimizer:
type: adadelta
rho: 0.9
eps: 1.0e-06
weight_decay: 0.0
rho
(default:0.9
): Coefficient used for computing a running average of squared gradients.eps
(default:1e-06
): Term added to the denominator to improve numerical stability.weight_decay
(default:0.0
): Weight decay ($L2$ penalty).
adagrad¶
optimizer:
type: adagrad
initial_accumulator_value: 0
lr_decay: 0
weight_decay: 0
eps: 1.0e-10
initial_accumulator_value
(default:0
):lr_decay
(default:0
): Learning rate decay.weight_decay
(default:0
): Weight decay ($L2$ penalty).eps
(default:1e-10
): Term added to the denominator to improve numerical stability.
adagrad_8bit¶
optimizer:
type: adagrad_8bit
initial_accumulator_value: 0
lr_decay: 0
weight_decay: 0
eps: 1.0e-10
block_wise: true
percentile_clipping: 100
initial_accumulator_value
(default:0
):lr_decay
(default:0
): Learning rate decay.weight_decay
(default:0
): Weight decay ($L2$ penalty).eps
(default:1e-10
): Term added to the denominator to improve numerical stability.block_wise
(default:true
): Whether to use block wise update. Options:true
,false
.percentile_clipping
(default:100
): Percentile clipping.
adamax¶
optimizer:
type: adamax
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.weight_decay
(default:0.0
): Weight decay ($L2$ penalty).
nadam¶
optimizer:
type: nadam
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
momentum_decay: 0.004
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.weight_decay
(default:0.0
): Weight decay ($L2$ penalty).momentum_decay
(default:0.004
): Momentum decay.
rmsprop¶
optimizer:
type: rmsprop
momentum: 0.0
alpha: 0.99
eps: 1.0e-08
centered: false
weight_decay: 0.0
momentum
(default:0.0
): Momentum factor.alpha
(default:0.99
): Smoothing constant.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.centered
(default:false
): If True, computes the centered RMSProp, and the gradient is normalized by an estimation of its variance. Options:true
,false
.weight_decay
(default:0.0
): Weight decay ($L2$ penalty).
rmsprop_8bit¶
optimizer:
type: rmsprop_8bit
momentum: 0.0
alpha: 0.99
eps: 1.0e-08
centered: false
weight_decay: 0.0
block_wise: true
percentile_clipping: 100
momentum
(default:0.0
): Momentum factor.alpha
(default:0.99
): Smoothing constant.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.centered
(default:false
): If True, computes the centered RMSProp, and the gradient is normalized by an estimation of its variance. Options:true
,false
.weight_decay
(default:0.0
): Weight decay ($L2$ penalty).block_wise
(default:true
): Whether to use block wise update. Options:true
,false
.percentile_clipping
(default:100
): Percentile clipping.
lamb¶
optimizer:
type: lamb
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
amsgrad: false
bias_correction: true
adam_w_mode: true
percentile_clipping: 100
block_wise: false
max_unorm: 1.0
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.weight_decay
(default:0.0
): Weight decay (L2 penalty).amsgrad
(default:false
): Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam and Beyond'. Options:true
,false
.bias_correction
(default:true
): Options:true
,false
.adam_w_mode
(default:true
): Whether to use the AdamW mode of this algorithm from the paper 'Decoupled Weight Decay Regularization'. Options:true
,false
.percentile_clipping
(default:100
): Percentile clipping.block_wise
(default:false
): Whether to use block wise update. Options:true
,false
.max_unorm
(default:1.0
):
lamb_8bit¶
optimizer:
type: lamb_8bit
betas:
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
amsgrad: false
bias_correction: true
adam_w_mode: true
percentile_clipping: 100
block_wise: false
max_unorm: 1.0
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.eps
(default:1e-08
): Term added to the denominator to improve numerical stability.weight_decay
(default:0.0
): Weight decay (L2 penalty).amsgrad
(default:false
): Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam and Beyond'. Options:true
,false
.bias_correction
(default:true
): Options:true
,false
.adam_w_mode
(default:true
): Whether to use the AdamW mode of this algorithm from the paper 'Decoupled Weight Decay Regularization'. Options:true
,false
.percentile_clipping
(default:100
): Percentile clipping.block_wise
(default:false
): Whether to use block wise update. Options:true
,false
.max_unorm
(default:1.0
):
lars¶
optimizer:
type: lars
momentum: 0.9
dampening: 0.0
weight_decay: 0.0
nesterov: false
percentile_clipping: 100
max_unorm: 1.0
momentum
(default:0.9
): Momentum factor.dampening
(default:0.0
): Dampening for momentum.weight_decay
(default:0.0
): Weight decay (L2 penalty).nesterov
(default:false
): Enables Nesterov momentum. Options:true
,false
.percentile_clipping
(default:100
): Percentile clipping.max_unorm
(default:1.0
):
lars_8bit¶
optimizer:
type: lars_8bit
momentum: 0.9
dampening: 0.0
weight_decay: 0.0
nesterov: false
percentile_clipping: 100
max_unorm: 1.0
momentum
(default:0.9
): Momentum factor.dampening
(default:0.0
): Dampening for momentum.weight_decay
(default:0.0
): Weight decay (L2 penalty).nesterov
(default:false
): Enables Nesterov momentum. Options:true
,false
.percentile_clipping
(default:100
): Percentile clipping.max_unorm
(default:1.0
):
lion¶
optimizer:
type: lion
betas:
- 0.9
- 0.999
weight_decay: 0.0
percentile_clipping: 100
block_wise: true
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.weight_decay
(default:0.0
): Weight decay (L2 penalty).percentile_clipping
(default:100
): Percentile clipping.block_wise
(default:true
): Whether to use block wise update. Options:true
,false
.
lion_8bit¶
optimizer:
type: lion_8bit
betas:
- 0.9
- 0.999
weight_decay: 0.0
percentile_clipping: 100
block_wise: true
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.weight_decay
(default:0.0
): Weight decay (L2 penalty).percentile_clipping
(default:100
): Percentile clipping.block_wise
(default:true
): Whether to use block wise update. Options:true
,false
.
paged_lion¶
optimizer:
type: paged_lion
betas:
- 0.9
- 0.999
weight_decay: 0.0
percentile_clipping: 100
block_wise: true
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.weight_decay
(default:0.0
): Weight decay (L2 penalty).percentile_clipping
(default:100
): Percentile clipping.block_wise
(default:true
): Whether to use block wise update. Options:true
,false
.
paged_lion_8bit¶
optimizer:
type: paged_lion_8bit
betas:
- 0.9
- 0.999
weight_decay: 0.0
percentile_clipping: 100
block_wise: true
betas
(default:[0.9, 0.999]
): Coefficients used for computing running averages of gradient and its square.weight_decay
(default:0.0
): Weight decay (L2 penalty).percentile_clipping
(default:100
): Percentile clipping.block_wise
(default:true
): Whether to use block wise update. Options:true
,false
.
Note
Gradient clipping is also configurable, through optimizers, with the following parameters:
clip_global_norm: 0.5
clipnorm: null
clip_value: null
No optimizer parameters are available for the LightGBM trainer.
Training length¶
The length of the training process is configured by:
epochs
(default: 100): One epoch is one pass through the entire dataset. By default,epochs
is 100 which means that the training process will run for a maximum of 100 epochs before terminating.train_steps
(default:None
): The maximum number of steps to train for, using one mini-batch per step. By default this is unset, andepochs
will be used to determine training length.
num_boost_round
(default: 100): The number of boosting iterations. By default,num_boost_round
is 100 which means that the training process will run for a maximum of 100 boosting iterations before terminating.
Tip
In general, it's a good idea to set up a long training runway, relying on
early stopping criteria (early_stop
) to stop training when there
hasn't been any improvement for a long time.
Early stopping¶
Machine learning models, when trained for too long, are often prone to overfitting. It's generally a good policy to set up some early stopping criteria as it's not useful to have a model train after it's maximized what it can learn, as to retain it's ability to generalize to new data.
How early stopping works in Ludwig¶
By default, Ludwig sets trainer.early_stop=5
, which means that if there have
been 5
consecutive rounds of evaluation where there hasn't been any
improvement on the validation subset, then training will terminate.
Ludwig runs evaluation once per checkpoint, which by default is once per epoch.
Checkpoint frequency can be configured using checkpoints_per_epoch
(default:
1
) or steps_per_checkpoint
(default: 0
, disabled). See
this section for more details.
Changing the metric early stopping metrics¶
The metric that dictates early stopping is
trainer.validation_field
and trainer.validation_metric
. By default, early
stopping uses the combined loss on the validation subset.
trainer:
validation_field: combined
validation_metric: loss
However, this can be configured to use other metrics. For example, if we had an
output feature called recommended
, then we can configure early stopping on the
output feature accuracy like so:
trainer:
validation_field: recommended
validation_metric: accuracy
Disabling early stopping¶
trainer.early_stop
can be set to -1
, which disables early stopping entirely.
Checkpoint-evaluation frequency¶
Evaluation is run every time the model is checkpointed.
By default, checkpoint-evaluation will occur once every epoch.
The frequency of checkpoint-evaluation can be configured using:
steps_per_checkpoint
(default: 0): everyn
training stepscheckpoints_per_epoch
(default: 0):n
times per epoch
Note
It is invalid to specify both non-zero steps_per_checkpoint
and non-zero
checkpoints_per_epoch
.
Tip
Running evaluation once per epoch is an appropriate fit for small datasets that fit in memory and train quickly. However, this can be a poor fit for unstructured datasets, which tend to be much larger, and train more slowly due to larger models.
Running evaluation too frequently can be wasteful while running evaluation not frequently enough can be uninformative. In large-scale training runs, it's common for evaluation to be configured to run on a sub-epoch time scale, or every few thousand steps.
We recommend configuring evaluation such that new evaluation results are available at least several times an hour. In general, it is not necessary for models to train over the entirety of a dataset, nor evaluate over the entirety of a test set, to produce useful monitoring metrics and signals to indicate model performance.
Increasing throughput on GPUs¶
Increase batch size¶
trainer:
batch_size: auto
Users training on GPUs can often increase training throughput by increasing
the batch_size
so that more examples are computed every training step. Set
batch_size
to auto
to use the largest batch size that can fit in memory.
Use mixed precision¶
trainer:
use_mixed_precision: true
Speeds up training by using float16 parameters where it makes sense. Mixed precision training on GPU can dramatically speedup training, with some risks to model convergence. In practice, it works particularly well when fine-tuning a pretrained model like a HuggingFace transformer. See blog here for more details.