Alignment

LLM Alignment with Preference Learning¶

Alignment training adapts a language model so its outputs match human values and preferences. The classic RLHF pipeline — collect human rankings, train a reward model, run PPO — is expensive and notoriously unstable. Ludwig provides a family of modern preference learning trainers that skip the reward model entirely and can be run with the same ludwig train command used for standard fine-tuning.

This page covers:

When to use each alignment method
How to prepare datasets for each trainer
CLI and Python API examples
Uploading the aligned model to HuggingFace Hub

The full runnable notebook is in examples/alignment/alignment_dpo.ipynb.

Alignment trainers¶

Ludwig supports four preference learning trainers, all accessible via trainer.type:

Trainer	Paper	Data format	Key hyperparameter
`dpo`	Rafailov et al., 2023	`prompt`, `chosen`, `rejected`	`beta` — KL penalty (default 0.1)
`kto`	Ethayarajh et al., 2024	`prompt`, `response`, `label` (bool)	`desirable_weight`, `undesirable_weight`
`orpo`	Hong et al., 2024	`prompt`, `chosen`, `rejected`	— (no reference model)
`grpo`	Shao et al., 2024	`prompt` + reward function	`num_generations` — rollouts per prompt

DPO¶

Direct Preference Optimization reformulates the RLHF objective so the policy is the only learned parameter — no reward model is required. The loss increases the log-probability of the chosen response relative to the rejected response, subject to a KL-divergence penalty controlled by beta.

DPO is the most widely studied alignment method and a natural first choice when you have paired preference data.

KTO¶

Kahneman-Tversky Optimization draws on prospect theory: humans are more sensitive to losses than to equivalent gains. KTO uses this insight to define a loss that works with single-label feedback — each response is simply marked as desirable (label=True) or undesirable (label=False).

This makes KTO well-suited when collecting binary user feedback (thumbs up/down, click-through rates) is easier than asking annotators to compare two responses head-to-head.

ORPO¶

Odds Ratio Preference Optimization combines supervised fine-tuning and preference alignment in a single training objective. Unlike DPO, ORPO does not require a separate reference model forward pass, which reduces GPU memory and compute by roughly half. Use ORPO when you want to skip the SFT stage and align in one shot.

GRPO¶

Group Relative Policy Optimization is a reinforcement learning trainer that generates multiple completions per prompt, scores them with a reward function, and trains the policy using the group-normalised advantage. Unlike DPO and KTO, GRPO does not require pre-collected human annotations — the reward function can be entirely programmatic (e.g. pass/fail unit tests for code generation, exact-match for math).

Dataset preparation¶

DPO and ORPO¶

Your dataset must contain prompt, chosen, and rejected columns.

prompt,chosen,rejected
"Explain gravity in simple terms","Gravity is a force that pulls objects with mass toward each other...","IDK lol"
"Write a haiku about autumn","Leaves drift silently / Crimson and gold fill the air / Winter waits below","autumn is nice i guess"

The Anthropic HH-RLHF dataset is a common starting point. The prepare_dataset.py script in the examples/alignment directory downloads it and converts it automatically:

python examples/alignment/prepare_dataset.py --output_dir data/ --max_train_samples 10000

To do the conversion manually:

import re
import pandas as pd
from datasets import load_dataset


def last_human_turn(conv):
    turns = re.findall(r"\n\nHuman: (.*?)(?=\n\nAssistant:|\Z)", conv, re.DOTALL)
    return turns[-1].strip() if turns else conv.strip()


def last_assistant_turn(conv):
    turns = re.findall(r"\n\nAssistant: (.*?)(?=\n\nHuman:|\Z)", conv, re.DOTALL)
    return turns[-1].strip() if turns else ""


hh = load_dataset("Anthropic/hh-rlhf")

rows = []
for ex in hh["train"]:
    prompt = last_human_turn(ex["chosen"])
    chosen = last_assistant_turn(ex["chosen"])
    rejected = last_assistant_turn(ex["rejected"])
    if prompt and chosen and rejected:
        rows.append({"prompt": prompt, "chosen": chosen, "rejected": rejected})

pd.DataFrame(rows).to_csv("train.csv", index=False)

KTO¶

Expand each preference pair into two rows with a boolean label:

kto_rows = []
for _, row in df.iterrows():
    kto_rows.append({"prompt": row["prompt"], "response": row["chosen"],   "label": True})
    kto_rows.append({"prompt": row["prompt"], "response": row["rejected"], "label": False})

pd.DataFrame(kto_rows).to_csv("train_kto.csv", index=False)

Training¶

DPO¶

CLIPython

ludwig train --config examples/alignment/config_dpo.yaml --dataset data/train.csv

import logging
import yaml
from ludwig.api import LudwigModel

config = yaml.safe_load("""
model_type: llm
base_model: meta-llama/Llama-3.1-8B

adapter:
  type: lora
  r: 16
  alpha: 32
  dropout: 0.05

trainer:
  type: dpo
  epochs: 1
  learning_rate: 5.0e-7
  batch_size: 2
  gradient_accumulation_steps: 8
  beta: 0.1

input_features:
  - name: prompt
    type: text

output_features:
  - name: chosen
    type: text

backend:
  type: local
""")

model = LudwigModel(config=config, logging_level=logging.INFO)
train_stats, _, output_dir = model.train(dataset="data/train.csv")

KTO¶

CLIPython

ludwig train --config examples/alignment/config_kto.yaml --dataset data/train_kto.csv

Change the trainer block and output feature name:

config["trainer"] = {
    "type": "kto",
    "epochs": 1,
    "learning_rate": 5e-7,
    "batch_size": 2,
    "gradient_accumulation_steps": 8,
    "beta": 0.1,
    "desirable_weight": 1.0,
    "undesirable_weight": 1.0,
}
config["output_features"] = [{"name": "response", "type": "text"}]

ORPO¶

CLI

ludwig train --config examples/alignment/config_orpo.yaml --dataset data/train.csv

ORPO uses the same prompt/chosen/rejected columns as DPO. The trainer merges the SFT and alignment losses so no reference model is needed — this saves roughly half the GPU memory compared to DPO.

Upload to HuggingFace Hub¶

After training, share the aligned model:

CLIPython

ludwig upload hf_hub -r <your_org>/<model_name> -m results/experiment_run/model

from ludwig.api import LudwigModel

LudwigModel.upload_to_hf_hub("your_org/model_name", "results/experiment_run/model")

Alignment

LLM Alignment with Preference Learning¶

Alignment trainers¶

DPO¶

KTO¶

ORPO¶

GRPO¶

Dataset preparation¶

DPO and ORPO¶

KTO¶

Training¶

DPO¶

KTO¶

ORPO¶

Upload to HuggingFace Hub¶

See also¶