Optimizer Comparison
Optimizer Comparison¶
Choosing the right optimizer can meaningfully affect convergence speed and final accuracy.
Ludwig exposes all major optimizers through the trainer.optimizer config block.
Supported optimizers¶
| Optimizer | Key strength | Typical use case |
|---|---|---|
sgd |
Baseline, well-understood | Small tabular models |
adam |
Fast convergence, adaptive LR | Default for most tasks |
adamw |
Adam + decoupled weight decay | Fine-tuning pretrained models |
rmsprop |
Stable for RNNs | Sequence models |
lion |
Memory-efficient (no second moment) | Large models on limited memory |
sophia |
Second-order curvature estimate | Transformers, NLP tasks |
Configurations¶
Adam (default)¶
trainer:
optimizer:
type: adam
lr: 1.0e-3
betas:
- 0.9
- 0.999
eps: 1.0e-8
AdamW¶
AdamW decouples weight decay from the gradient update, which prevents weight decay from inadvertently acting as an adaptive learning rate dampener. Use it whenever training from scratch with regularisation or when fine-tuning.
trainer:
optimizer:
type: adamw
lr: 3.0e-4
weight_decay: 0.01
Lion¶
Lion (EvoLved Sign Momentum) uses only the sign of the gradient update, requiring no second moment accumulator. This halves optimizer memory usage vs. Adam/AdamW — a major advantage when training large models. Lion typically works best with a learning rate 3–10× smaller than AdamW.
trainer:
optimizer:
type: lion
lr: 3.0e-5
weight_decay: 0.1
betas:
- 0.9
- 0.99
Sophia¶
Sophia estimates the diagonal Hessian of the loss using a Hutchinson estimator and uses it to precondition the gradient step. This can accelerate training of transformer-based models by 2× over AdamW on NLP benchmarks.
trainer:
optimizer:
type: sophia
lr: 2.0e-4
betas:
- 0.965
- 0.99
rho: 0.04
weight_decay: 0.1
update_period: 10 # how often to re-estimate the Hessian
Learning rate schedulers¶
Optimizers are paired with learning rate schedulers. The most commonly used are:
trainer:
learning_rate_scheduler:
type: cosine # cosine annealing — good default for fine-tuning
# type: reduce_on_plateau # reduce LR when val loss stagnates
# type: linear_warmup # warmup then constant — good for transformers
warmup_fraction: 0.1 # fraction of training steps used for warm-up
Hyperopt search¶
Use Ludwig's hyperopt to find the best optimizer and learning rate automatically:
hyperopt:
search_alg: hyperband
goal: minimize
metric: validation_loss
parameters:
trainer.optimizer.type:
type: category
values: [adam, adamw, lion]
trainer.optimizer.lr:
type: float
low: 1.0e-5
high: 1.0e-2
scale: log
trainer.optimizer.weight_decay:
type: float
low: 0.0
high: 0.1
Practical guidance¶
- Start with AdamW for fine-tuning tasks and Adam for training from scratch.
- Switch to Lion if GPU memory is tight and you cannot reduce batch size further.
- Try Sophia for transformer encoders in NLP tasks when training time matters.
- Learning rate is the most important hyperparameter — always tune it first regardless of optimizer.
- Warm-up (5–10 % of total steps) is important for all adaptive optimizers when training transformers.