Alignment
LLM Alignment with Preference Learning¶
Alignment training adapts a language model so its outputs match human values and preferences.
The classic RLHF pipeline — collect human rankings, train a reward model, run PPO — is expensive
and notoriously unstable. Ludwig provides a family of modern preference learning trainers that
skip the reward model entirely and can be run with the same ludwig train command used for
standard fine-tuning.
This page covers:
- When to use each alignment method
- How to prepare datasets for each trainer
- CLI and Python API examples
- Uploading the aligned model to HuggingFace Hub
The full runnable notebook is in
examples/alignment/alignment_dpo.ipynb.
Alignment trainers¶
Ludwig supports four preference learning trainers, all accessible via trainer.type:
| Trainer | Paper | Data format | Key hyperparameter |
|---|---|---|---|
dpo |
Rafailov et al., 2023 | prompt, chosen, rejected |
beta — KL penalty (default 0.1) |
kto |
Ethayarajh et al., 2024 | prompt, response, label (bool) |
desirable_weight, undesirable_weight |
orpo |
Hong et al., 2024 | prompt, chosen, rejected |
— (no reference model) |
grpo |
Shao et al., 2024 | prompt + reward function |
num_generations — rollouts per prompt |
DPO¶
Direct Preference Optimization reformulates the RLHF objective so the policy is the only
learned parameter — no reward model is required. The loss increases the log-probability of the
chosen response relative to the rejected response, subject to a KL-divergence penalty controlled
by beta.
DPO is the most widely studied alignment method and a natural first choice when you have paired preference data.
KTO¶
Kahneman-Tversky Optimization draws on prospect theory: humans are more sensitive to losses
than to equivalent gains. KTO uses this insight to define a loss that works with single-label
feedback — each response is simply marked as desirable (label=True) or undesirable (label=False).
This makes KTO well-suited when collecting binary user feedback (thumbs up/down, click-through rates) is easier than asking annotators to compare two responses head-to-head.
ORPO¶
Odds Ratio Preference Optimization combines supervised fine-tuning and preference alignment in a single training objective. Unlike DPO, ORPO does not require a separate reference model forward pass, which reduces GPU memory and compute by roughly half. Use ORPO when you want to skip the SFT stage and align in one shot.
GRPO¶
Group Relative Policy Optimization is a reinforcement learning trainer that generates multiple completions per prompt, scores them with a reward function, and trains the policy using the group-normalised advantage. Unlike DPO and KTO, GRPO does not require pre-collected human annotations — the reward function can be entirely programmatic (e.g. pass/fail unit tests for code generation, exact-match for math).
Dataset preparation¶
DPO and ORPO¶
Your dataset must contain prompt, chosen, and rejected columns.
prompt,chosen,rejected
"Explain gravity in simple terms","Gravity is a force that pulls objects with mass toward each other...","IDK lol"
"Write a haiku about autumn","Leaves drift silently / Crimson and gold fill the air / Winter waits below","autumn is nice i guess"
The Anthropic HH-RLHF dataset is a common
starting point. The prepare_dataset.py script in the
examples/alignment
directory downloads it and converts it automatically:
python examples/alignment/prepare_dataset.py --output_dir data/ --max_train_samples 10000
To do the conversion manually:
import re
import pandas as pd
from datasets import load_dataset
def last_human_turn(conv):
turns = re.findall(r"\n\nHuman: (.*?)(?=\n\nAssistant:|\Z)", conv, re.DOTALL)
return turns[-1].strip() if turns else conv.strip()
def last_assistant_turn(conv):
turns = re.findall(r"\n\nAssistant: (.*?)(?=\n\nHuman:|\Z)", conv, re.DOTALL)
return turns[-1].strip() if turns else ""
hh = load_dataset("Anthropic/hh-rlhf")
rows = []
for ex in hh["train"]:
prompt = last_human_turn(ex["chosen"])
chosen = last_assistant_turn(ex["chosen"])
rejected = last_assistant_turn(ex["rejected"])
if prompt and chosen and rejected:
rows.append({"prompt": prompt, "chosen": chosen, "rejected": rejected})
pd.DataFrame(rows).to_csv("train.csv", index=False)
KTO¶
Expand each preference pair into two rows with a boolean label:
kto_rows = []
for _, row in df.iterrows():
kto_rows.append({"prompt": row["prompt"], "response": row["chosen"], "label": True})
kto_rows.append({"prompt": row["prompt"], "response": row["rejected"], "label": False})
pd.DataFrame(kto_rows).to_csv("train_kto.csv", index=False)
Training¶
DPO¶
ludwig train --config examples/alignment/config_dpo.yaml --dataset data/train.csv
import logging
import yaml
from ludwig.api import LudwigModel
config = yaml.safe_load("""
model_type: llm
base_model: meta-llama/Llama-3.1-8B
adapter:
type: lora
r: 16
alpha: 32
dropout: 0.05
trainer:
type: dpo
epochs: 1
learning_rate: 5.0e-7
batch_size: 2
gradient_accumulation_steps: 8
beta: 0.1
input_features:
- name: prompt
type: text
output_features:
- name: chosen
type: text
backend:
type: local
""")
model = LudwigModel(config=config, logging_level=logging.INFO)
train_stats, _, output_dir = model.train(dataset="data/train.csv")
KTO¶
ludwig train --config examples/alignment/config_kto.yaml --dataset data/train_kto.csv
Change the trainer block and output feature name:
config["trainer"] = {
"type": "kto",
"epochs": 1,
"learning_rate": 5e-7,
"batch_size": 2,
"gradient_accumulation_steps": 8,
"beta": 0.1,
"desirable_weight": 1.0,
"undesirable_weight": 1.0,
}
config["output_features"] = [{"name": "response", "type": "text"}]
ORPO¶
ludwig train --config examples/alignment/config_orpo.yaml --dataset data/train.csv
ORPO uses the same prompt/chosen/rejected columns as DPO. The trainer merges the SFT
and alignment losses so no reference model is needed — this saves roughly half the GPU memory
compared to DPO.
Upload to HuggingFace Hub¶
After training, share the aligned model:
ludwig upload hf_hub -r <your_org>/<model_name> -m results/experiment_run/model
from ludwig.api import LudwigModel
LudwigModel.upload_to_hf_hub("your_org/model_name", "results/experiment_run/model")
See also¶
- Fine-tuning guide — standard SFT, LoRA, and quantization options
- LLM configuration reference — full config schema
- Llama-2 fine-tuning example — QLoRA instruction tuning