Skip to content

Optimizer Comparison

Optimizer Comparison

Choosing the right optimizer can meaningfully affect convergence speed and final accuracy. Ludwig exposes all major optimizers through the trainer.optimizer config block.

Supported optimizers

Optimizer Key strength Typical use case
sgd Baseline, well-understood Small tabular models
adam Fast convergence, adaptive LR Default for most tasks
adamw Adam + decoupled weight decay Fine-tuning pretrained models
rmsprop Stable for RNNs Sequence models
lion Memory-efficient (no second moment) Large models on limited memory
sophia Second-order curvature estimate Transformers, NLP tasks

Configurations

Adam (default)

trainer:
  optimizer:
    type: adam
    lr: 1.0e-3
    betas:
      - 0.9
      - 0.999
    eps: 1.0e-8

AdamW

AdamW decouples weight decay from the gradient update, which prevents weight decay from inadvertently acting as an adaptive learning rate dampener. Use it whenever training from scratch with regularisation or when fine-tuning.

trainer:
  optimizer:
    type: adamw
    lr: 3.0e-4
    weight_decay: 0.01

Lion

Lion (EvoLved Sign Momentum) uses only the sign of the gradient update, requiring no second moment accumulator. This halves optimizer memory usage vs. Adam/AdamW — a major advantage when training large models. Lion typically works best with a learning rate 3–10× smaller than AdamW.

trainer:
  optimizer:
    type: lion
    lr: 3.0e-5
    weight_decay: 0.1
    betas:
      - 0.9
      - 0.99

Sophia

Sophia estimates the diagonal Hessian of the loss using a Hutchinson estimator and uses it to precondition the gradient step. This can accelerate training of transformer-based models by 2× over AdamW on NLP benchmarks.

trainer:
  optimizer:
    type: sophia
    lr: 2.0e-4
    betas:
      - 0.965
      - 0.99
    rho: 0.04
    weight_decay: 0.1
    update_period: 10     # how often to re-estimate the Hessian

Learning rate schedulers

Optimizers are paired with learning rate schedulers. The most commonly used are:

trainer:
  learning_rate_scheduler:
    type: cosine          # cosine annealing — good default for fine-tuning
    # type: reduce_on_plateau  # reduce LR when val loss stagnates
    # type: linear_warmup   # warmup then constant — good for transformers
    warmup_fraction: 0.1  # fraction of training steps used for warm-up

Use Ludwig's hyperopt to find the best optimizer and learning rate automatically:

hyperopt:
  search_alg: hyperband
  goal: minimize
  metric: validation_loss
  parameters:
    trainer.optimizer.type:
      type: category
      values: [adam, adamw, lion]
    trainer.optimizer.lr:
      type: float
      low: 1.0e-5
      high: 1.0e-2
      scale: log
    trainer.optimizer.weight_decay:
      type: float
      low: 0.0
      high: 0.1

Practical guidance

  • Start with AdamW for fine-tuning tasks and Adam for training from scratch.
  • Switch to Lion if GPU memory is tight and you cannot reduce batch size further.
  • Try Sophia for transformer encoders in NLP tasks when training time matters.
  • Learning rate is the most important hyperparameter — always tune it first regardless of optimizer.
  • Warm-up (5–10 % of total steps) is important for all adaptive optimizers when training transformers.