Skip to content

Trainer

Overview

The trainer section of the configuration lets you specify parameters that configure the training process, like the number of epochs or the learning rate. By default, the ECD trainer is used.

trainer:
    early_stop: 5
    learning_rate: 0.001
    epochs: 100
    batch_size: auto
    regularization_type: l2
    use_mixed_precision: false
    compile: false
    checkpoints_per_epoch: 0
    eval_steps: null
    effective_batch_size: auto
    gradient_accumulation_steps: auto
    regularization_lambda: 0.0
    enable_gradient_checkpointing: false
    validation_field: null
    validation_metric: null
    train_steps: null
    steps_per_checkpoint: 0
    max_batch_size: 1099511627776
    eval_batch_size: null
    evaluate_training_set: false
    should_shuffle: true
    increase_batch_size_on_plateau: 0
    increase_batch_size_on_plateau_patience: 5
    increase_batch_size_on_plateau_rate: 2.0
    increase_batch_size_eval_metric: loss
    increase_batch_size_eval_split: training
    learning_rate_scaling: linear
    bucketing_field: null
    skip_all_evaluation: false
    enable_profiling: false
    profiler:
        wait: 1
        warmup: 1
        active: 3
        repeat: 5
        skip_first: 0
    learning_rate_scheduler:
        decay: null
        decay_rate: 0.96
        decay_steps: 10000
        staircase: false
        reduce_on_plateau: 0
        reduce_on_plateau_patience: 10
        reduce_on_plateau_rate: 0.1
        warmup_evaluations: 0
        warmup_fraction: 0.0
        reduce_eval_metric: loss
        reduce_eval_split: training
        t_0: null
        t_mult: 1
        eta_min: 0
        max_lr: null
        pct_start: 0.3
        div_factor: 25.0
        final_div_factor: 10000.0
        inverse_sqrt_warmup_steps: 4000
        polynomial_power: 1.0
        polynomial_end_lr: 0.0
        wsd_warmup_fraction: 0.1
        wsd_stable_fraction: 0.8
        wsd_decay_fraction: 0.1
    optimizer:
        type: adam
        betas:
        - 0.9
        - 0.999
        eps: 1.0e-08
        weight_decay: 0.0
        amsgrad: false
    gradient_clipping:
        clipglobalnorm: 0.5
        clipnorm: null
        clipvalue: null
    layers_to_freeze_regex: null
    loss_balancing: none
    loss_balancing_alpha: 1.5
    loss_balancing_lr: 0.01
    loss_balancing_preference_vector: null
    loss_balancing_tchebycheff_weight: 0.5
    contrastive_pretrain_epochs: 0
    contrastive_pretrain_temperature: 0.07
    contrastive_pretrain_projection_dim: 128
    contrastive_pretrain_learnable_temperature: true
    modality_dropout: 0.0
    model_soup: null
    model_soup_top_k: 5
trainer:
    type: finetune
    learning_rate: 0.001
    validation_field: null
    validation_metric: null
    early_stop: 5
    skip_all_evaluation: false
    enable_profiling: false
    profiler:
        wait: 1
        warmup: 1
        active: 3
        repeat: 5
        skip_first: 0
    learning_rate_scheduler:
        decay: null
        decay_rate: 0.96
        decay_steps: 10000
        staircase: false
        reduce_on_plateau: 0
        reduce_on_plateau_patience: 10
        reduce_on_plateau_rate: 0.1
        warmup_evaluations: 0
        warmup_fraction: 0.0
        reduce_eval_metric: loss
        reduce_eval_split: training
        t_0: null
        t_mult: 1
        eta_min: 0
        max_lr: null
        pct_start: 0.3
        div_factor: 25.0
        final_div_factor: 10000.0
        inverse_sqrt_warmup_steps: 4000
        polynomial_power: 1.0
        polynomial_end_lr: 0.0
        wsd_warmup_fraction: 0.1
        wsd_stable_fraction: 0.8
        wsd_decay_fraction: 0.1
    epochs: 100
    checkpoints_per_epoch: 0
    train_steps: null
    eval_steps: null
    steps_per_checkpoint: 0
    effective_batch_size: auto
    batch_size: 1
    max_batch_size: 1099511627776
    gradient_accumulation_steps: auto
    eval_batch_size: 2
    evaluate_training_set: false
    optimizer:
        type: adam
        betas:
        - 0.9
        - 0.999
        eps: 1.0e-08
        weight_decay: 0.0
        amsgrad: false
    regularization_type: l2
    regularization_lambda: 0.0
    should_shuffle: true
    increase_batch_size_on_plateau: 0
    increase_batch_size_on_plateau_patience: 5
    increase_batch_size_on_plateau_rate: 2.0
    increase_batch_size_eval_metric: loss
    increase_batch_size_eval_split: training
    gradient_clipping:
        clipglobalnorm: 0.5
        clipnorm: null
        clipvalue: null
    learning_rate_scaling: linear
    bucketing_field: null
    use_mixed_precision: false
    compile: false
    enable_gradient_checkpointing: false
    layers_to_freeze_regex: null
    loss_balancing: none
    loss_balancing_alpha: 1.5
    loss_balancing_lr: 0.01
    loss_balancing_preference_vector: null
    loss_balancing_tchebycheff_weight: 0.5
    contrastive_pretrain_epochs: 0
    contrastive_pretrain_temperature: 0.07
    contrastive_pretrain_projection_dim: 128
    contrastive_pretrain_learnable_temperature: true
    modality_dropout: 0.0
    model_soup: null
    model_soup_top_k: 5
    base_learning_rate: 0.0

Trainer parameters

  • early_stop (default: 5) : Number of consecutive rounds of evaluation without any improvement on the validation_metric that triggers training to stop. Can be set to -1, which disables early stopping entirely.
  • learning_rate (default: null) : Controls how much to change the model in response to the estimated error each time the model weights are updated. If 'auto', the optimal learning rate is estimated by choosing the learning rate that produces the smallest non-diverging gradient update.
  • epochs (default: 100) : Number of epochs the algorithm is intended to be run over. Overridden if train_steps is set
  • batch_size (default: auto) : The number of training examples utilized in one training step of the model. If ’auto’, the batch size that maximized training throughput (samples / sec) will be used. For CPU training, the tuned batch size is capped at 128 as throughput benefits of large batch sizes are less noticeable without a GPU.
  • regularization_type (default: l2) : Type of regularization. Options: l1, l2, l1_l2, null.
  • use_mixed_precision (default: false) : Enable automatic mixed-precision (AMP) during training.
  • compile (default: false) : Whether to compile the model before training.
  • checkpoints_per_epoch (default: 0): Number of checkpoints per epoch. For example, 2 -> checkpoints are written every half of an epoch. Note that it is invalid to specify both non-zero steps_per_checkpoint and non-zero checkpoints_per_epoch.
  • eval_steps (default: null): The number of steps to use for evaluation. If None, the entire evaluation set will be used.
  • effective_batch_size (default: auto): The effective batch size is the total number of samples used to compute a single gradient update to the model weights. This differs from batch_size by taking gradient_accumulation_steps and number of training worker processes into account. In practice, effective_batch_size = batch_size * gradient_accumulation_steps * num_workers. If 'auto', the effective batch size is derivied implicitly from batch_size, but if set explicitly, then one of batch_size or gradient_accumulation_steps must be set to something other than 'auto', and consequently will be set following the formula given above.
  • gradient_accumulation_steps (default: auto): Number of steps to accumulate gradients over before performing a weight update.
  • regularization_lambda (default: 0.0): Strength of the regularization.
  • enable_gradient_checkpointing (default: false): Whether to enable gradient checkpointing, which trades compute for memory.This is useful for training very deep models with limited memory.
  • validation_field (default: null): The field for which the validation_metric is used for validation-related mechanics like early stopping, parameter change plateaus, as well as what hyperparameter optimization uses to determine the best trial. If unset (default), the first output feature is used. If explicitly specified, neither validation_field nor validation_metric are overwritten.
  • validation_metric (default: null): Metric from validation_field that is used. If validation_field is not explicitly specified, this is overwritten to be the first output feature type's default_validation_metric, consistent with validation_field. If the validation_metric is specified, then we will use the first output feature that produces this metric as the validation_field.
  • train_steps (default: null): Maximum number of training steps the algorithm is intended to be run over. Unset by default. If set, will override epochs and if left unset then epochs is used to determine training length.
  • steps_per_checkpoint (default: 0): How often the model is checkpointed. Also dictates maximum evaluation frequency. If 0 the model is checkpointed after every epoch.
  • max_batch_size (default: 1099511627776): Auto batch size tuning and increasing batch size on plateau will be capped at this value. The default value is 2^40.
  • eval_batch_size (default: null): Size of batch to pass to the model for evaluation. If it is 0 or None, the same value of batch_size is used. This is useful to speedup evaluation with a much bigger batch size than training, if enough memory is available. If ’auto’, the biggest batch size (power of 2) that can fit in memory will be used.
  • evaluate_training_set (default: false): Whether to evaluate on the entire training set during evaluation. By default, training metrics will be computed at the end of each training step, and accumulated up to the evaluation phase. In practice, computing training set metrics during training is up to 30% faster than running a separate evaluation pass over the training set, but results in more noisy training metrics, particularly during the earlier epochs. It's recommended to only set this to True if you need very exact training set metrics, and are willing to pay a significant performance penalty for them.
  • should_shuffle (default: true): Whether to shuffle batches during training when true.
  • increase_batch_size_on_plateau (default: 0): The number of times to increase the batch size on a plateau.
  • increase_batch_size_on_plateau_patience (default: 5): How many epochs to wait for before increasing the batch size.
  • increase_batch_size_on_plateau_rate (default: 2.0): Rate at which the batch size increases.
  • increase_batch_size_eval_metric (default: loss): Which metric to listen on for increasing the batch size.
  • increase_batch_size_eval_split (default: training): Which dataset split to listen on for increasing the batch size.
  • learning_rate_scaling (default: linear): Scale by which to increase the learning rate as the number of distributed workers increases. Traditionally the learning rate is scaled linearly with the number of workers to reflect the proportion by which the effective batch size is increased. For very large batch sizes, a softer square-root scale can sometimes lead to better model performance. If the learning rate is hand-tuned for a given number of workers, setting this value to constant can be used to disable scale-up. Options: constant, sqrt, linear.
  • bucketing_field (default: null): Feature to use for bucketing datapoints
  • skip_all_evaluation (default: false):
  • enable_profiling (default: false):
  • profiler (default: null):
  • profiler.wait (default: 1): The number of steps to wait profiling.
  • profiler.warmup (default: 1): The number of steps for profiler warmup after waiting finishes.
  • profiler.active (default: 3): The number of steps that are actively recorded. Values more than 10 wil dramatically slow down tensorboard loading.
  • profiler.repeat (default: 5): The optional number of profiling cycles. Use 0 to profile the entire training run.
  • profiler.skip_first (default: 0): The number of steps to skip in the beginning of training.
  • learning_rate_scheduler (default: null):
  • learning_rate_scheduler.decay (default: null) : Learning rate decay schedule. Options: 'linear', 'exponential', 'cosine', 'one_cycle', 'inverse_sqrt', 'polynomial', 'wsd'. Options: linear, exponential, cosine, one_cycle, inverse_sqrt, polynomial, wsd, null.
  • learning_rate_scheduler.decay_rate (default: 0.96): Decay per epoch (%): Factor to decrease the Learning rate.
  • learning_rate_scheduler.decay_steps (default: 10000): The number of steps to take in the exponential learning rate decay.
  • learning_rate_scheduler.staircase (default: false): Decays the learning rate at discrete intervals.
  • learning_rate_scheduler.reduce_on_plateau (default: 0) : How many times to reduce the learning rate when the algorithm hits a plateau (i.e. the performance on the training set does not improve)
  • learning_rate_scheduler.reduce_on_plateau_patience (default: 10): How many evaluation steps have to pass before the learning rate reduces when reduce_on_plateau > 0.
  • learning_rate_scheduler.reduce_on_plateau_rate (default: 0.1): Rate at which we reduce the learning rate when reduce_on_plateau > 0.
  • learning_rate_scheduler.warmup_evaluations (default: 0): Number of evaluation steps to warmup the learning rate for.
  • learning_rate_scheduler.warmup_fraction (default: 0.0): Fraction of total training steps to warmup the learning rate for.
  • learning_rate_scheduler.reduce_eval_metric (default: loss): Metric plateau used to trigger when we reduce the learning rate when reduce_on_plateau > 0.
  • learning_rate_scheduler.reduce_eval_split (default: training): Which dataset split to listen on for reducing the learning rate when reduce_on_plateau > 0.
  • learning_rate_scheduler.t_0 (default: null): Number of steps before the first restart for cosine annealing decay. If not specified, it will be set to steps_per_checkpoint.
  • learning_rate_scheduler.t_mult (default: 1): Period multiplier after each restart for cosine annealing decay. Defaults to 1, i.e., restart every t_0 steps. If set to a larger value, the period between restarts increases by that multiplier. For e.g., if t_mult is 2, then the periods would be: t_0, 2t_0, 2^2t_0, 2^3*t_0, etc.
  • learning_rate_scheduler.eta_min (default: 0): Minimum learning rate allowed for cosine annealing decay. Default: 0.
  • learning_rate_scheduler.max_lr (default: null): Maximum learning rate for the OneCycleLR scheduler. If None, defaults to the optimizer's base learning rate. Used only when decay='one_cycle'.
  • learning_rate_scheduler.pct_start (default: 0.3): Fraction of training steps spent increasing the learning rate in the OneCycleLR scheduler. Used only when decay='one_cycle'.
  • learning_rate_scheduler.div_factor (default: 25.0): Determines the initial learning rate (initial_lr = max_lr / div_factor) for OneCycleLR. Used only when decay='one_cycle'.
  • learning_rate_scheduler.final_div_factor (default: 10000.0): Determines the minimum learning rate (min_lr = initial_lr / final_div_factor) for OneCycleLR. Used only when decay='one_cycle'.
  • learning_rate_scheduler.inverse_sqrt_warmup_steps (default: 4000): Number of warmup steps for the inverse square root scheduler. After warmup, the LR decays as 1/sqrt(step). This is the classic Transformer schedule from Vaswani et al. (2017). Used only when decay='inverse_sqrt'.
  • learning_rate_scheduler.polynomial_power (default: 1.0): Power of the polynomial decay. power=1.0 gives linear decay; higher values give more concave decay curves. Used only when decay='polynomial'.
  • learning_rate_scheduler.polynomial_end_lr (default: 0.0): Final (minimum) learning rate at the end of polynomial decay. Used only when decay='polynomial'.
  • learning_rate_scheduler.wsd_warmup_fraction (default: 0.1): Fraction of total training steps spent in the linear warmup phase of the WSD scheduler. Used only when decay='wsd'.
  • learning_rate_scheduler.wsd_stable_fraction (default: 0.8): Fraction of total training steps spent in the constant LR phase of the WSD scheduler. Used only when decay='wsd'.
  • learning_rate_scheduler.wsd_decay_fraction (default: 0.1): Fraction of total training steps spent in the decay phase of the WSD scheduler. wsd_warmup_fraction + wsd_stable_fraction + wsd_decay_fraction should sum to 1. Used only when decay='wsd'.
  • optimizer (default: null): See Optimizer parameters for details.
  • gradient_clipping (default: null):
  • gradient_clipping.clipglobalnorm (default: 0.5): Maximum allowed norm of the gradients
  • gradient_clipping.clipnorm (default: null): Maximum allowed norm of the gradients
  • gradient_clipping.clipvalue (default: null): Maximum allowed value of the gradients
  • layers_to_freeze_regex (default: null): Freeze specific layers based on provided regex. Freezing specific layers can improve a pretrained model's performance in a number of ways. At a basic level, freezing early layers can prevent overfitting by retaining more general features (beneficial for small datasets). Also can reduce computational resource use and lower overall training time due to less gradient calculations.
  • loss_balancing (default: none): Multi-task loss balancing strategy for models with multiple output features. 'none': static weighted sum (default). 'log_transform': log(1+loss) compression (DB-MTL). 'uncertainty': learnable homoscedastic uncertainty weighting (Kendall et al., CVPR 2018). 'famo': fast adaptive multitask optimization (Liu et al., NeurIPS 2023). 'gradnorm': gradient normalization (Chen et al., ICML 2018). 'nash_mtl': Nash bargaining solution for multi-task weighting (Navon et al., ICML 2022). 'pareto_mtl': Pareto-optimal multi-task learning with preference vectors (Lin et al., NeurIPS 2019). Options: none, log_transform, uncertainty, famo, gradnorm, nash_mtl, pareto_mtl.
  • loss_balancing_alpha (default: 1.5): Asymmetry parameter for gradnorm and smoothing for famo loss balancing.
  • loss_balancing_lr (default: 0.01): Learning rate for famo loss balancing weight updates.
  • loss_balancing_preference_vector (default: null): Preference vector used by loss_balancing: pareto_mtl. One entry per output feature (in the order they appear in output_features), non-negative, normalised internally to sum to 1. Training is steered toward the Pareto-optimal point where the task losses are inversely proportional to this vector. When null, a uniform preference is used.
  • loss_balancing_tchebycheff_weight (default: 0.5): Mixing weight for pareto_mtl between the linear-scalarised term (weight = 1 - tchebycheff_weight) and the Tchebycheff max term (weight = tchebycheff_weight). Pure Tchebycheff (1.0) enforces exact preference adherence but is rough to train; pure linear (0.0) trains smoothly but diverges from the exact preference. Default 0.5 matches Mahapatra & Rajan's 'mixed-exact' scalarisation (ICML 2020).
  • contrastive_pretrain_epochs (default: 0): Number of epochs of contrastive pre-alignment between per-feature encoders to run before the main training loop. 0 disables pre-alignment (default). A brief warmup (1-3 epochs) is usually enough to pull encoder output spaces into alignment so the downstream combiner sees already-comparable representations. Inspired by CLIP-style alignment (Radford et al., ICML 2021) adapted to Ludwig's multi-encoder ECD architecture.
  • contrastive_pretrain_temperature (default: 0.07): Initial InfoNCE temperature for contrastive pre-alignment. Lower values sharpen the softmax. 0.07 matches CLIP's initial value.
  • contrastive_pretrain_projection_dim (default: 128): Width of the shared projection space used during contrastive pre-alignment. The per-feature projection heads are discarded after pre-alignment β€” only the updated encoder weights carry forward into the main training loop.
  • contrastive_pretrain_learnable_temperature (default: true): When True (default), the InfoNCE log-temperature is a trainable parameter following the CLIP convention. Set to False to fix the temperature at contrastive_pretrain_temperature throughout pre-alignment.
  • modality_dropout (default: 0.0): Probability of dropping each input feature's encoder output during training. Dropped features are replaced with learnable missing-modality embeddings. Set to 0.0 to disable (default). Improves robustness to missing inputs at inference.
  • model_soup (default: null): Model soup strategy for averaging top-K checkpoint weights after training. 'uniform': average all top-K checkpoints. 'greedy': greedily add checkpoints that improve validation metric. None to disable (default). Options: uniform, greedy, null.
  • model_soup_top_k (default: 5): Number of top checkpoints to keep for model soup.
  • type (default: finetune): Options: finetune.
  • learning_rate (default: null) : Controls how much to change the model in response to the estimated error each time the model weights are updated. If 'auto', the optimal learning rate is estimated by choosing the learning rate that produces the smallest non-diverging gradient update.
  • validation_field (default: null):
  • validation_metric (default: null):
  • early_stop (default: 5):
  • skip_all_evaluation (default: false):
  • enable_profiling (default: false):
  • profiler (default: null):
  • profiler.wait (default: 1): The number of steps to wait profiling.
  • profiler.warmup (default: 1): The number of steps for profiler warmup after waiting finishes.
  • profiler.active (default: 3): The number of steps that are actively recorded. Values more than 10 wil dramatically slow down tensorboard loading.
  • profiler.repeat (default: 5): The optional number of profiling cycles. Use 0 to profile the entire training run.
  • profiler.skip_first (default: 0): The number of steps to skip in the beginning of training.
  • learning_rate_scheduler (default: null):
  • learning_rate_scheduler.decay (default: null) : Learning rate decay schedule. Options: 'linear', 'exponential', 'cosine', 'one_cycle', 'inverse_sqrt', 'polynomial', 'wsd'. Options: linear, exponential, cosine, one_cycle, inverse_sqrt, polynomial, wsd, null.
  • learning_rate_scheduler.decay_rate (default: 0.96): Decay per epoch (%): Factor to decrease the Learning rate.
  • learning_rate_scheduler.decay_steps (default: 10000): The number of steps to take in the exponential learning rate decay.
  • learning_rate_scheduler.staircase (default: false): Decays the learning rate at discrete intervals.
  • learning_rate_scheduler.reduce_on_plateau (default: 0) : How many times to reduce the learning rate when the algorithm hits a plateau (i.e. the performance on the training set does not improve)
  • learning_rate_scheduler.reduce_on_plateau_patience (default: 10): How many evaluation steps have to pass before the learning rate reduces when reduce_on_plateau > 0.
  • learning_rate_scheduler.reduce_on_plateau_rate (default: 0.1): Rate at which we reduce the learning rate when reduce_on_plateau > 0.
  • learning_rate_scheduler.warmup_evaluations (default: 0): Number of evaluation steps to warmup the learning rate for.
  • learning_rate_scheduler.warmup_fraction (default: 0.0): Fraction of total training steps to warmup the learning rate for.
  • learning_rate_scheduler.reduce_eval_metric (default: loss): Metric plateau used to trigger when we reduce the learning rate when reduce_on_plateau > 0.
  • learning_rate_scheduler.reduce_eval_split (default: training): Which dataset split to listen on for reducing the learning rate when reduce_on_plateau > 0.
  • learning_rate_scheduler.t_0 (default: null): Number of steps before the first restart for cosine annealing decay. If not specified, it will be set to steps_per_checkpoint.
  • learning_rate_scheduler.t_mult (default: 1): Period multiplier after each restart for cosine annealing decay. Defaults to 1, i.e., restart every t_0 steps. If set to a larger value, the period between restarts increases by that multiplier. For e.g., if t_mult is 2, then the periods would be: t_0, 2t_0, 2^2t_0, 2^3*t_0, etc.
  • learning_rate_scheduler.eta_min (default: 0): Minimum learning rate allowed for cosine annealing decay. Default: 0.
  • learning_rate_scheduler.max_lr (default: null): Maximum learning rate for the OneCycleLR scheduler. If None, defaults to the optimizer's base learning rate. Used only when decay='one_cycle'.
  • learning_rate_scheduler.pct_start (default: 0.3): Fraction of training steps spent increasing the learning rate in the OneCycleLR scheduler. Used only when decay='one_cycle'.
  • learning_rate_scheduler.div_factor (default: 25.0): Determines the initial learning rate (initial_lr = max_lr / div_factor) for OneCycleLR. Used only when decay='one_cycle'.
  • learning_rate_scheduler.final_div_factor (default: 10000.0): Determines the minimum learning rate (min_lr = initial_lr / final_div_factor) for OneCycleLR. Used only when decay='one_cycle'.
  • learning_rate_scheduler.inverse_sqrt_warmup_steps (default: 4000): Number of warmup steps for the inverse square root scheduler. After warmup, the LR decays as 1/sqrt(step). This is the classic Transformer schedule from Vaswani et al. (2017). Used only when decay='inverse_sqrt'.
  • learning_rate_scheduler.polynomial_power (default: 1.0): Power of the polynomial decay. power=1.0 gives linear decay; higher values give more concave decay curves. Used only when decay='polynomial'.
  • learning_rate_scheduler.polynomial_end_lr (default: 0.0): Final (minimum) learning rate at the end of polynomial decay. Used only when decay='polynomial'.
  • learning_rate_scheduler.wsd_warmup_fraction (default: 0.1): Fraction of total training steps spent in the linear warmup phase of the WSD scheduler. Used only when decay='wsd'.
  • learning_rate_scheduler.wsd_stable_fraction (default: 0.8): Fraction of total training steps spent in the constant LR phase of the WSD scheduler. Used only when decay='wsd'.
  • learning_rate_scheduler.wsd_decay_fraction (default: 0.1): Fraction of total training steps spent in the decay phase of the WSD scheduler. wsd_warmup_fraction + wsd_stable_fraction + wsd_decay_fraction should sum to 1. Used only when decay='wsd'.
  • epochs (default: 100):
  • checkpoints_per_epoch (default: 0):
  • train_steps (default: null):
  • eval_steps (default: null):
  • steps_per_checkpoint (default: 0):
  • effective_batch_size (default: auto):
  • batch_size (default: 1): The number of training examples utilized in one training step of the model. If auto, the batch size that maximized training throughput (samples / sec) will be used.
  • max_batch_size (default: 1099511627776):
  • gradient_accumulation_steps (default: auto):
  • eval_batch_size (default: 2): Size of batch to pass to the model for evaluation. If it is 0 or None, the same value of batch_size is used. This is useful to speedup evaluation with a much bigger batch size than training, if enough memory is available. If auto, the biggest batch size (power of 2) that can fit in memory will be used.
  • evaluate_training_set (default: false):
  • optimizer (default: null): See Optimizer parameters for details.
  • regularization_type (default: l2):
  • regularization_lambda (default: 0.0):
  • should_shuffle (default: true):
  • increase_batch_size_on_plateau (default: 0):
  • increase_batch_size_on_plateau_patience (default: 5):
  • increase_batch_size_on_plateau_rate (default: 2.0):
  • increase_batch_size_eval_metric (default: loss):
  • increase_batch_size_eval_split (default: training):
  • gradient_clipping (default: null):
  • gradient_clipping.clipglobalnorm (default: 0.5): Maximum allowed norm of the gradients
  • gradient_clipping.clipnorm (default: null): Maximum allowed norm of the gradients
  • gradient_clipping.clipvalue (default: null): Maximum allowed value of the gradients
  • learning_rate_scaling (default: linear):
  • bucketing_field (default: null):
  • use_mixed_precision (default: false):
  • compile (default: false):
  • enable_gradient_checkpointing (default: false):
  • layers_to_freeze_regex (default: null):
  • loss_balancing (default: none):
  • loss_balancing_alpha (default: 1.5):
  • loss_balancing_lr (default: 0.01):
  • loss_balancing_preference_vector (default: null):
  • loss_balancing_tchebycheff_weight (default: 0.5):
  • contrastive_pretrain_epochs (default: 0):
  • contrastive_pretrain_temperature (default: 0.07):
  • contrastive_pretrain_projection_dim (default: 128):
  • contrastive_pretrain_learnable_temperature (default: true):
  • modality_dropout (default: 0.0):
  • model_soup (default: null):
  • model_soup_top_k (default: 5):
  • base_learning_rate (default: 0.0): Base learning rate used for training in the LLM trainer.

Optimizer parameters

The available optimizers wrap the ones available in PyTorch. For details about the parameters that can be used to configure different optimizers, please refer to the PyTorch documentation.

The learning_rate parameter used by the optimizer comes from the trainer section. Other optimizer specific parameters, shown with their Ludwig default settings, follow:

sgd

optimizer:
    type: sgd
    momentum: 0.0
    weight_decay: 0.0
    dampening: 0.0
    nesterov: false
  • momentum (default: 0.0): Momentum factor.
  • weight_decay (default: 0.0): Weight decay ($L2$ penalty).
  • dampening (default: 0.0): Dampening for momentum.
  • nesterov (default: false): Enables Nesterov momentum.

sgd_8bit

optimizer:
    type: sgd_8bit
    momentum: 0.0
    weight_decay: 0.0
    dampening: 0.0
    nesterov: false
    block_wise: false
    percentile_clipping: 100
  • momentum (default: 0.0):
  • weight_decay (default: 0.0):
  • dampening (default: 0.0):
  • nesterov (default: false):
  • block_wise (default: false): Whether to use block wise update.
  • percentile_clipping (default: 100): Percentile clipping.

lbfgs

optimizer:
    type: lbfgs
    max_iter: 20
    max_eval: null
    tolerance_grad: 1.0e-07
    tolerance_change: 1.0e-09
    history_size: 100
    line_search_fn: null
  • max_iter (default: 20): Maximum number of iterations per optimization step.
  • max_eval (default: null): Maximum number of function evaluations per optimization step. Default: max_iter * 1.25.
  • tolerance_grad (default: 1e-07): Termination tolerance on first order optimality.
  • tolerance_change (default: 1e-09): Termination tolerance on function value/parameter changes.
  • history_size (default: 100): Update history size.
  • line_search_fn (default: null): Line search function to use. Options: strong_wolfe, null.

adam

optimizer:
    type: adam
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0.0
    amsgrad: false
  • betas (default: [0.9, 0.999]): Coefficients used for computing running averages of gradient and its square.
  • eps (default: 1e-08): Term added to the denominator to improve numerical stability.
  • weight_decay (default: 0.0): Weight decay (L2 penalty).
  • amsgrad (default: false): Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam and Beyond'.

adam_8bit

optimizer:
    type: adam_8bit
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0.0
    amsgrad: false
    block_wise: true
    percentile_clipping: 100
  • betas (default: [0.9, 0.999]):
  • eps (default: 1e-08):
  • weight_decay (default: 0.0):
  • amsgrad (default: false):
  • block_wise (default: true): Whether to use block wise update.
  • percentile_clipping (default: 100): Percentile clipping.

paged_adam

optimizer:
    type: paged_adam
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0.0
    amsgrad: false
    block_wise: true
    percentile_clipping: 100
  • betas (default: [0.9, 0.999]):
  • eps (default: 1e-08):
  • weight_decay (default: 0.0):
  • amsgrad (default: false):
  • block_wise (default: true):
  • percentile_clipping (default: 100):

paged_adam_8bit

optimizer:
    type: paged_adam_8bit
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0.0
    amsgrad: false
    block_wise: true
    percentile_clipping: 100
  • betas (default: [0.9, 0.999]):
  • eps (default: 1e-08):
  • weight_decay (default: 0.0):
  • amsgrad (default: false):
  • block_wise (default: true):
  • percentile_clipping (default: 100):

adamw

optimizer:
    type: adamw
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0.0
    amsgrad: false
  • betas (default: [0.9, 0.999]): Coefficients used for computing running averages of gradient and its square.
  • eps (default: 1e-08): Term added to the denominator to improve numerical stability.
  • weight_decay (default: 0.0): Weight decay ($L2$ penalty).
  • amsgrad (default: false): Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam and Beyond'.

adamw_8bit

optimizer:
    type: adamw_8bit
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0.0
    amsgrad: false
    block_wise: true
    percentile_clipping: 100
  • betas (default: [0.9, 0.999]):
  • eps (default: 1e-08):
  • weight_decay (default: 0.0):
  • amsgrad (default: false):
  • block_wise (default: true): Whether to use block wise update.
  • percentile_clipping (default: 100): Percentile clipping.

paged_adamw

optimizer:
    type: paged_adamw
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0.0
    amsgrad: false
    block_wise: true
    percentile_clipping: 100
  • betas (default: [0.9, 0.999]):
  • eps (default: 1e-08):
  • weight_decay (default: 0.0):
  • amsgrad (default: false):
  • block_wise (default: true):
  • percentile_clipping (default: 100):

paged_adamw_8bit

optimizer:
    type: paged_adamw_8bit
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0.0
    amsgrad: false
    block_wise: true
    percentile_clipping: 100
  • betas (default: [0.9, 0.999]):
  • eps (default: 1e-08):
  • weight_decay (default: 0.0):
  • amsgrad (default: false):
  • block_wise (default: true):
  • percentile_clipping (default: 100):

adadelta

optimizer:
    type: adadelta
    rho: 0.9
    eps: 1.0e-06
    weight_decay: 0.0
  • rho (default: 0.9): Coefficient used for computing a running average of squared gradients.
  • eps (default: 1e-06): Term added to the denominator to improve numerical stability.
  • weight_decay (default: 0.0): Weight decay ($L2$ penalty).

adagrad

optimizer:
    type: adagrad
    initial_accumulator_value: 0
    lr_decay: 0
    weight_decay: 0
    eps: 1.0e-10
  • initial_accumulator_value (default: 0):
  • lr_decay (default: 0): Learning rate decay.
  • weight_decay (default: 0): Weight decay ($L2$ penalty).
  • eps (default: 1e-10): Term added to the denominator to improve numerical stability.

adagrad_8bit

optimizer:
    type: adagrad_8bit
    initial_accumulator_value: 0
    lr_decay: 0
    weight_decay: 0
    eps: 1.0e-10
    block_wise: true
    percentile_clipping: 100
  • initial_accumulator_value (default: 0):
  • lr_decay (default: 0):
  • weight_decay (default: 0):
  • eps (default: 1e-10):
  • block_wise (default: true): Whether to use block wise update.
  • percentile_clipping (default: 100): Percentile clipping.

adamax

optimizer:
    type: adamax
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0.0
  • betas (default: [0.9, 0.999]): Coefficients used for computing running averages of gradient and its square.
  • eps (default: 1e-08): Term added to the denominator to improve numerical stability.
  • weight_decay (default: 0.0): Weight decay ($L2$ penalty).

nadam

optimizer:
    type: nadam
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0.0
    momentum_decay: 0.004
  • betas (default: [0.9, 0.999]): Coefficients used for computing running averages of gradient and its square.
  • eps (default: 1e-08): Term added to the denominator to improve numerical stability.
  • weight_decay (default: 0.0): Weight decay ($L2$ penalty).
  • momentum_decay (default: 0.004): Momentum decay.

rmsprop

optimizer:
    type: rmsprop
    momentum: 0.0
    alpha: 0.99
    eps: 1.0e-08
    centered: false
    weight_decay: 0.0
  • momentum (default: 0.0): Momentum factor.
  • alpha (default: 0.99): Smoothing constant.
  • eps (default: 1e-08): Term added to the denominator to improve numerical stability.
  • centered (default: false): If True, computes the centered RMSProp, and the gradient is normalized by an estimation of its variance.
  • weight_decay (default: 0.0): Weight decay ($L2$ penalty).

rmsprop_8bit

optimizer:
    type: rmsprop_8bit
    momentum: 0.0
    alpha: 0.99
    eps: 1.0e-08
    centered: false
    weight_decay: 0.0
    block_wise: true
    percentile_clipping: 100
  • momentum (default: 0.0):
  • alpha (default: 0.99):
  • eps (default: 1e-08):
  • centered (default: false):
  • weight_decay (default: 0.0):
  • block_wise (default: true): Whether to use block wise update.
  • percentile_clipping (default: 100): Percentile clipping.

lamb

optimizer:
    type: lamb
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0.0
    amsgrad: false
    bias_correction: true
    adam_w_mode: true
    percentile_clipping: 100
    block_wise: false
    max_unorm: 1.0
  • betas (default: [0.9, 0.999]): Coefficients used for computing running averages of gradient and its square.
  • eps (default: 1e-08): Term added to the denominator to improve numerical stability.
  • weight_decay (default: 0.0): Weight decay (L2 penalty).
  • amsgrad (default: false): Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam and Beyond'.
  • bias_correction (default: true):
  • adam_w_mode (default: true): Whether to use the AdamW mode of this algorithm from the paper 'Decoupled Weight Decay Regularization'.
  • percentile_clipping (default: 100): Percentile clipping.
  • block_wise (default: false): Whether to use block wise update.
  • max_unorm (default: 1.0):

lamb_8bit

optimizer:
    type: lamb_8bit
    bias_correction: true
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0.0
    amsgrad: false
    adam_w_mode: true
    percentile_clipping: 100
    block_wise: false
    max_unorm: 1.0
  • bias_correction (default: true):
  • betas (default: [0.9, 0.999]):
  • eps (default: 1e-08):
  • weight_decay (default: 0.0):
  • amsgrad (default: false):
  • adam_w_mode (default: true):
  • percentile_clipping (default: 100):
  • block_wise (default: false):
  • max_unorm (default: 1.0):

lars

optimizer:
    type: lars
    momentum: 0.9
    dampening: 0.0
    weight_decay: 0.0
    nesterov: false
    percentile_clipping: 100
    max_unorm: 1.0
  • momentum (default: 0.9): Momentum factor.
  • dampening (default: 0.0): Dampening for momentum.
  • weight_decay (default: 0.0): Weight decay (L2 penalty).
  • nesterov (default: false): Enables Nesterov momentum.
  • percentile_clipping (default: 100): Percentile clipping.
  • max_unorm (default: 1.0):

lars_8bit

optimizer:
    type: lars_8bit
    momentum: 0.9
    dampening: 0.0
    weight_decay: 0.0
    nesterov: false
    percentile_clipping: 100
    max_unorm: 1.0
  • momentum (default: 0.9):
  • dampening (default: 0.0):
  • weight_decay (default: 0.0):
  • nesterov (default: false):
  • percentile_clipping (default: 100):
  • max_unorm (default: 1.0):

lion

optimizer:
    type: lion
    betas:
    - 0.9
    - 0.999
    weight_decay: 0.0
    percentile_clipping: 100
    block_wise: true
  • betas (default: [0.9, 0.999]): Coefficients used for computing running averages of gradient and its square.
  • weight_decay (default: 0.0): Weight decay (L2 penalty).
  • percentile_clipping (default: 100): Percentile clipping.
  • block_wise (default: true): Whether to use block wise update.

lion_8bit

optimizer:
    type: lion_8bit
    betas:
    - 0.9
    - 0.999
    weight_decay: 0.0
    percentile_clipping: 100
    block_wise: true
  • betas (default: [0.9, 0.999]):
  • weight_decay (default: 0.0):
  • percentile_clipping (default: 100):
  • block_wise (default: true):

paged_lion

optimizer:
    type: paged_lion
    betas:
    - 0.9
    - 0.999
    weight_decay: 0.0
    percentile_clipping: 100
    block_wise: true
  • betas (default: [0.9, 0.999]):
  • weight_decay (default: 0.0):
  • percentile_clipping (default: 100):
  • block_wise (default: true):

paged_lion_8bit

optimizer:
    type: paged_lion_8bit
    betas:
    - 0.9
    - 0.999
    weight_decay: 0.0
    percentile_clipping: 100
    block_wise: true
  • betas (default: [0.9, 0.999]):
  • weight_decay (default: 0.0):
  • percentile_clipping (default: 100):
  • block_wise (default: true):

radam

optimizer:
    type: radam
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0.0
  • betas (default: [0.9, 0.999]): Coefficients used for computing running averages of gradient and its square.
  • eps (default: 1e-08): Term added to the denominator to improve numerical stability.
  • weight_decay (default: 0.0): Weight decay (L2 penalty).

adafactor

optimizer:
    type: adafactor
    lr: null
    scale_parameter: true
    relative_step: true
    warmup_init: false
  • lr (default: null): Learning rate. Set to None (default) when relative_step=True so that Adafactor manages its own schedule. Must be provided when relative_step=False.
  • scale_parameter (default: true): If True, the learning rate is scaled by the root mean square of the parameters. Should be True when relative_step=True.
  • relative_step (default: true): If True, a time-dependent learning rate is computed instead of using the external lr. Do not combine with an external LR scheduler.
  • warmup_init (default: false): If True, the time-dependent learning rate is linearly increased at initialization. Only effective when relative_step=True.

schedule_free_adamw

optimizer:
    type: schedule_free_adamw
    betas:
    - 0.9
    - 0.999
    weight_decay: 0.0
    warmup_steps: 0
  • betas (default: [0.9, 0.999]): Coefficients used for computing running averages of gradient and its square.
  • weight_decay (default: 0.0): Weight decay (decoupled L2 penalty).
  • warmup_steps (default: 0): Number of linear warmup steps built into the optimizer. Replaces an external warmup scheduler -- do not combine with one.

muon

optimizer:
    type: muon
    momentum: 0.95
    nesterov: true
  • momentum (default: 0.95): Momentum factor for Nesterov SGD applied before orthogonalization.
  • nesterov (default: true): If True, use Nesterov momentum (look-ahead gradient) before orthogonalization. The original Muon paper uses Nesterov.

Note

Gradient clipping is also configurable, through optimizers, with the following parameters:

clip_global_norm: 0.5
clipnorm: null
clip_value: null

Optimizer guidance

Ludwig 0.15 includes five optimizers on top of the existing PyTorch family. Quick picks:

  • radam β€” Rectified Adam (Liu et al., ICLR 2020). Drop-in replacement for adam that removes the need for manual warmup by adaptively rectifying the variance of the adaptive learning rate in the early steps.
  • adafactor β€” Adafactor (Shazeer & Stern, ICML 2018). Factorizes the second-moment matrix to cut optimizer memory roughly in half, which makes it a common choice for fine-tuning large transformers. When relative_step: true (the default) Adafactor manages its own schedule β€” do not combine it with a learning_rate_scheduler and leave learning_rate unset on the trainer.
  • schedule_free_adamw β€” Schedule-Free AdamW (Defazio et al., 2024). Matches cosine-decay AdamW without needing an LR scheduler at all. The optimizer maintains two iterate states β€” Ludwig handles the required optimizer.train() / optimizer.eval() calls automatically at the train/eval boundaries.
  • muon β€” Muon (Jordan et al., 2024). Uses momentum plus Newton–Schulz orthogonalization to produce stable, well-conditioned updates. Competitive with AdamW for pretraining and typically uses a much higher base learning rate than Adam (default 0.02).
  • soap β€” SOAP (Vyas et al., 2024). Shampoo-style preconditioner stacked on AdamW. Strong empirical results on large-scale training; requires installing the optional soap-pytorch package. Registered only when the dependency is available.

Note

The legacy ftrl optimizer was removed in 0.14. Configs that set optimizer.type: ftrl will fail validation β€” use adagrad or sgd with momentum as replacements.

Note

New in 0.15: muon and schedule_free_adamw optimizers are now stable and recommended for large-scale pretraining and fine-tuning respectively.

Learning rate schedulers

The learning_rate_scheduler section of the trainer controls how the learning rate evolves during training. Ludwig 0.15 supports four additional schedule types on top of linear, exponential, and cosine:

decay Best for Key parameters
one_cycle Fast supervised training, "superconvergence" max_lr, pct_start, div_factor, final_div_factor
inverse_sqrt Transformer pretraining ("Noam" schedule) inverse_sqrt_warmup_steps
polynomial Fine-tuning with a smooth ramp-down polynomial_power, polynomial_end_lr
wsd Long continued pretraining with annealing wsd_warmup_fraction, wsd_stable_fraction, wsd_decay_fraction

OneCycleLR

Implements Smith's 1-cycle policy: warm up from initial_lr = max_lr / div_factor to max_lr over pct_start of the total steps, then anneal down to min_lr = initial_lr / final_div_factor.

trainer:
  optimizer:
    type: adamw
  learning_rate_scheduler:
    decay: one_cycle
    max_lr: 0.001
    pct_start: 0.3
    div_factor: 25.0
    final_div_factor: 10000.0

If max_lr is left unset it defaults to the trainer's learning_rate.

Inverse square root (Noam)

After a linear warmup over inverse_sqrt_warmup_steps, the learning rate decays as 1 / sqrt(step). This is the schedule used in the original Transformer paper and is a good default for training language models from scratch.

trainer:
  learning_rate: 0.0005
  learning_rate_scheduler:
    decay: inverse_sqrt
    inverse_sqrt_warmup_steps: 4000

Polynomial decay

Smoothly decays from the base learning rate to polynomial_end_lr over the full training run. polynomial_power: 1.0 is linear decay; 2.0 is quadratic; values < 1.0 decay faster early on.

trainer:
  learning_rate_scheduler:
    decay: polynomial
    polynomial_power: 1.0
    polynomial_end_lr: 0.0

Warmup-Stable-Decay (WSD)

WSD splits the run into three phases β€” a short warmup, a long stable phase at the peak learning rate, and a short cooldown. Because the stable phase dominates, you can stop at any time and take a clean cooldown snapshot, which is useful for long pretraining runs where you want to branch continued training off of intermediate checkpoints.

trainer:
  learning_rate_scheduler:
    decay: wsd
    wsd_warmup_fraction: 0.1
    wsd_stable_fraction: 0.8
    wsd_decay_fraction: 0.1

The three fractions should sum to 1.0.

Warmup and plateau parameters

All schedules support the shared warmup and reduce-on-plateau knobs:

  • warmup_fraction / warmup_evaluations β€” linear warmup of the base learning rate.
  • reduce_on_plateau, reduce_on_plateau_patience, reduce_on_plateau_rate β€” cap the number of on-plateau reductions and the reduction factor. Combine with reduce_eval_metric and reduce_eval_split to pick which metric triggers the reduction.

Training length

The length of the training process is configured by:

  • epochs (default: 100): One epoch is one pass through the entire dataset. By default, epochs is 100 which means that the training process will run for a maximum of 100 epochs before terminating.
  • train_steps (default: None): The maximum number of steps to train for, using one mini-batch per step. By default this is unset, and epochs will be used to determine training length.

Tip

In general, it's a good idea to set up a long training runway, relying on early stopping criteria (early_stop) to stop training when there hasn't been any improvement for a long time.

Early stopping

Machine learning models, when trained for too long, are often prone to overfitting. It's generally a good policy to set up some early stopping criteria as it's not useful to have a model train after it's maximized what it can learn, as to retain it's ability to generalize to new data.

How early stopping works in Ludwig

By default, Ludwig sets trainer.early_stop=5, which means that if there have been 5 consecutive rounds of evaluation where there hasn't been any improvement on the validation subset, then training will terminate.

Ludwig runs evaluation once per checkpoint, which by default is once per epoch. Checkpoint frequency can be configured using checkpoints_per_epoch (default: 1) or steps_per_checkpoint (default: 0, disabled). See this section for more details.

Changing the metric early stopping metrics

The metric that dictates early stopping is trainer.validation_field and trainer.validation_metric. By default, early stopping uses the combined loss on the validation subset.

trainer:
    validation_field: combined
    validation_metric: loss

However, this can be configured to use other metrics. For example, if we had an output feature called recommended, then we can configure early stopping on the output feature accuracy like so:

trainer:
    validation_field: recommended
    validation_metric: accuracy

Disabling early stopping

trainer.early_stop can be set to -1, which disables early stopping entirely.

Checkpoint-evaluation frequency

Evaluation is run every time the model is checkpointed.

By default, checkpoint-evaluation will occur once every epoch.

The frequency of checkpoint-evaluation can be configured using:

  • steps_per_checkpoint (default: 0): every n training steps
  • checkpoints_per_epoch (default: 0): n times per epoch

Note

It is invalid to specify both non-zero steps_per_checkpoint and non-zero checkpoints_per_epoch.

Tip

Running evaluation once per epoch is an appropriate fit for small datasets that fit in memory and train quickly. However, this can be a poor fit for unstructured datasets, which tend to be much larger, and train more slowly due to larger models.

Running evaluation too frequently can be wasteful while running evaluation not frequently enough can be uninformative. In large-scale training runs, it's common for evaluation to be configured to run on a sub-epoch time scale, or every few thousand steps.

We recommend configuring evaluation such that new evaluation results are available at least several times an hour. In general, it is not necessary for models to train over the entirety of a dataset, nor evaluate over the entirety of a test set, to produce useful monitoring metrics and signals to indicate model performance.

Increasing throughput on GPUs

Increase batch size

trainer:
    batch_size: auto

Users training on GPUs can often increase training throughput by increasing the batch_size so that more examples are computed every training step. Set batch_size to auto to use the largest batch size that can fit in memory.

Use mixed precision

trainer:
    use_mixed_precision: true

Speeds up training by using float16 parameters where it makes sense. Mixed precision training on GPU can dramatically speedup training, with some risks to model convergence. In practice, it works particularly well when fine-tuning a pretrained model like a HuggingFace transformer. See blog here for more details.

Multi-Task Loss Balancing

When training models with multiple output features, the default behavior is to sum each feature's loss with a static weight. The loss_balancing parameter enables adaptive strategies that automatically balance task losses during training.

trainer:
    loss_balancing: uncertainty  # none, log_transform, uncertainty, famo, gradnorm

Available strategies:

  • none (default): Static weighted sum.
  • log_transform: Applies log(1 + loss) to compress loss scales before weighting. Simple and always beneficial when task losses have very different magnitudes.
  • uncertainty: Learns a log-variance parameter per task (Kendall et al., CVPR 2018). No hyperparameters needed.
  • famo: Fast Adaptive Multitask Optimization (Liu et al., NeurIPS 2023). O(1) overhead, competitive with gradient-based methods.
  • gradnorm: Gradient normalization across tasks (Chen et al., ICML 2018). Dynamically adjusts weights to normalize gradient magnitudes.

Modality Dropout

During training, randomly replaces input feature encoder outputs with learnable "missing modality" embeddings. This improves robustness when some inputs may be missing at inference time.

trainer:
    modality_dropout: 0.1  # probability per feature, 0.0 to disable

Model Soup

Averages the weights of the top-K checkpoints saved during training for better generalization at zero inference cost (Wortsman et al., ICML 2022).

trainer:
    model_soup: uniform  # uniform, greedy, or null to disable
    model_soup_top_k: 5

Quality Presets

Quality presets auto-configure the combiner, trainer, and other settings for different quality/speed tradeoffs. User-specified config values always take precedence over preset defaults.

preset: best_quality  # medium_quality, high_quality, or best_quality
input_features:
  - name: feature1
    type: number
output_features:
  - name: target
    type: category

Available presets:

  • medium_quality: Concat combiner, 50 epochs, batch_size 256. Fast training.
  • high_quality: Transformer combiner, uncertainty loss balancing, 100 epochs.
  • best_quality: FT-Transformer combiner, uncertainty loss balancing, model soup, 200 epochs.

Preference-Based LLM Training

Ludwig supports several preference optimization trainers for LLMs that align model outputs with human preferences. These trainers are available when using model_type: llm.

DPO Trainer

Direct Preference Optimization (Rafailov et al., NeurIPS 2023) trains a model to prefer chosen completions over rejected ones without a separate reward model. DPO reformulates the RLHF objective as a simple classification loss on preference pairs.

Requires data with a prompt column, the output column (containing chosen completions), and a rejected column (containing rejected completions).

model_type: llm
trainer:
    type: dpo
    dpo_beta: 0.1
    dpo_loss_type: sigmoid  # sigmoid or ipo
    dpo_label_smoothing: 0.0
    rejected_column: rejected

Parameters:

  • dpo_beta (default 0.1): Temperature parameter controlling how much the policy can deviate from the reference model. Lower values keep the policy closer to the reference. Typical range: 0.05 to 0.5.
  • dpo_loss_type (default sigmoid): DPO loss variant. sigmoid is the standard DPO loss. ipo is Identity Preference Optimization which uses a squared loss.
  • dpo_label_smoothing (default 0.0): Label smoothing for DPO preference targets. 0 means no smoothing.
  • rejected_column (default rejected): Name of the column containing rejected completions.

KTO Trainer

Kahneman-Tversky Optimization (Ethayarajh et al., 2024) is a preference optimization method that works with binary feedback (good/bad) rather than requiring paired preferences. This makes it practical when paired preference data is unavailable.

model_type: llm
trainer:
    type: kto
    kto_beta: 0.1
    rejected_column: rejected

ORPO Trainer

Odds Ratio Preference Optimization (Hong et al., 2024) combines supervised fine-tuning with preference optimization in a single training step, eliminating the need for a reference model.

model_type: llm
trainer:
    type: orpo
    orpo_beta: 0.1
    rejected_column: rejected

GRPO Trainer

Group Relative Policy Optimization (Shao et al., 2024) generates multiple completions per prompt and uses group-relative rewards to optimize the policy. This is the method used to train DeepSeek-R1.

model_type: llm
trainer:
    type: grpo
    grpo_beta: 0.04
    grpo_epsilon: 0.2
    grpo_num_generations: 4

Parameters:

  • grpo_beta (default 0.04): KL penalty coefficient.
  • grpo_epsilon (default 0.2): PPO clipping parameter.
  • grpo_num_generations (default 4): Number of completions to generate per prompt.