Advanced PEFT Adapters
Advanced PEFT Adapters¶
Ludwig's PEFT integration (backed by HuggingFace PEFT) goes well beyond standard LoRA. PR #4146 adds new LoRA initializers that improve convergence and final quality, per-module rank/alpha overrides, and several new adapter families covering orthogonal, wavelet-based, and layer-norm-only tuning strategies.
This page collects config snippets and short explanations for each advanced option. For the full parameter reference see the LLM configuration docs.
LoRA initializers¶
Standard LoRA initializes B = 0 so the adapter is a no-op at the start of training. The initializers
below start from a better point, which speeds up convergence and often improves the final metric.
PiSSA¶
Principal Singular Values and Singular Vectors Adaptation aligns the low-rank subspace with the
top-r singular components of each pretrained weight matrix. The residual is kept frozen. PiSSA
consistently outperforms standard LoRA at the same rank and requires no extra data.
model_type: llm
base_model: meta-llama/Llama-3.1-8B
adapter:
type: lora
r: 16
alpha: 16
init_lora_weights: pissa
trainer:
type: finetune
epochs: 3
learning_rate: 1e-4
CorDA¶
Correlation-Driven LoRA Adaptation initializes the subspace from activation correlations computed on a small calibration batch. It is most effective when a representative in-domain sample is available at initialization time.
adapter:
type: lora
r: 16
alpha: 16
init_lora_weights: corda
Layer-Norm tuning¶
Layer-Norm tuning trains only the weight and bias parameters of LayerNorm/RMSNorm layers. It is the
lightest adapter type available — often fewer than 0.1% of backbone parameters — and works well for
domain adaptation of already-instruction-tuned models where the knowledge is largely intact and only the
output distribution needs to shift.
model_type: llm
base_model: mistralai/Mistral-7B-Instruct-v0.3
input_features:
- name: prompt
type: text
output_features:
- name: response
type: text
adapter:
type: ln_tuning
trainer:
type: finetune
epochs: 2
learning_rate: 5e-4
batch_size: 4
gradient_accumulation_steps: 8
Orthogonal adapters¶
OFT¶
Orthogonal Fine-Tuning constrains weight updates to orthogonal transformations, preserving the hyperspherical geometry of the pretrained representations. This keeps the relative angles between token embeddings stable during fine-tuning and is particularly effective for tasks that depend on semantic similarity structure.
adapter:
type: oft
r: 8
module_dropout: 0.0
HRA¶
Householder Reflection Adaptation parameterizes updates as a product of Householder reflections. It achieves a similar orthogonality guarantee to OFT but with fewer parameters per layer.
adapter:
type: hra
r: 8
Both OFT and HRA are drop-in replacements for LoRA in any Ludwig LLM config — just change the type
field on adapter.
Wavelet-based tuning¶
WaveFT¶
WaveFT applies updates in the wavelet domain, concentrating the parameter budget on the frequency bands most perturbed during fine-tuning. It is especially useful for models that process structured signals (audio, images encoded as tokens) where frequency structure carries semantic meaning.
adapter:
type: waveft
r: 8
alpha: 16
Vector-bank adapters¶
VBLoRA¶
Vector-Bank LoRA replaces the per-layer B matrix with a shared global dictionary of vectors. Each
layer selects a subset of these vectors and linearly combines them. When many layers learn similar update
directions this can reduce total parameter count significantly versus standard LoRA.
adapter:
type: vblora
r: 4
num_vectors: 256
vector_length: 256
Comparison table¶
| Adapter | Extra params (approx.) | Key strength | Best for |
|---|---|---|---|
lora (default init) |
~0.1–1% | Versatile, well-studied | General fine-tuning baseline |
lora + pissa |
~0.1–1% | Better initialization, faster convergence | When standard LoRA underfits |
lora + corda |
~0.1–1% | Data-driven subspace alignment | In-domain adaptation with calibration data |
lora + loftq |
~0.1–1% | Minimises quantization error at init | 4-bit QLoRA fine-tuning |
ln_tuning |
<0.1% | Extremely lightweight | Domain shift with instruction-tuned base models |
oft |
~0.5–2% | Preserves hyperspherical geometry | Semantic similarity, generation fidelity |
hra |
~0.3–1% | Orthogonal, fewer params than OFT | Same as OFT with tighter parameter budget |
waveft |
~0.5–2% | Frequency-domain concentration | Audio/vision token models |
vblora |
~0.05–0.5% | Shared vector bank across layers | Very low parameter budgets |
c3a |
~0.2–1% | Block-sparse updates | Sparse activation models |
tinylora |
~0.05–0.5% | Learned rank allocation | Strict parameter count constraints |