Encoder Selection Guide
Encoder Selection Guide¶
Ludwig provides a wide catalog of encoders across all feature types. This guide helps you choose the right encoder for your task based on your data, accuracy requirements, and computational constraints.
Text Encoders¶
Text encoders process natural language input. Choose based on your document length, accuracy needs, and latency budget.
| Encoder | Best For | Speed | Max Tokens | Key Feature |
|---|---|---|---|---|
modernbert |
General NLU, classification | Fast | 8192 | SOTA encoder-only, Flash Attention + RoPE |
bert |
General NLU baseline | Medium | 512 | Most widely studied, huge ecosystem |
roberta |
NLU when more data available | Medium | 512 | Better pretraining than BERT |
distilbert |
Speed-sensitive deployment | Fast | 512 | 97% of BERT quality, 6x faster |
deberta |
Max benchmark accuracy | Slow | 512 | SOTA on SuperGLUE/SQuAD |
longformer |
Long documents | Medium | 4096 | Sliding window + global attention |
auto_transformer |
Any HF model | Varies | Varies | Use for GTE, BGE, E5, Nomic, etc. |
tf_idf |
Interpretability, tiny data | Very Fast | N/A | Non-neural baseline |
Text Encoder Decision Tree¶
Text encoder selection:
Need >512 tokens? --> longformer or modernbert
Need fastest option? --> distilbert or tf_idf
Need max accuracy? --> deberta
Need custom HF model? --> auto_transformer
Default choice --> modernbert or bert
Tips¶
- modernbert is the recommended default for most NLU tasks. It supports long contexts up to 8192 tokens and uses Flash Attention for efficient training.
- auto_transformer is the escape hatch -- use it to load any model from the Hugging Face Hub
(e.g.,
sentence-transformers/all-MiniLM-L6-v2,BAAI/bge-large-en-v1.5). - tf_idf is useful as a non-neural baseline for comparison, or when you need fully interpretable features on very small datasets.
- For classification tasks, set
trainable: falseto use the pretrained encoder as a frozen feature extractor, which is faster and avoids overfitting on small data.
Image Encoders¶
Image encoders process visual input. Pretrained encoders generally outperform training from scratch unless you have a very specialized domain.
| Encoder | Best For | Pretrained | Key Feature |
|---|---|---|---|
resnet |
General classification baseline | ImageNet | Well-understood, fast |
efficientnet |
Mobile/efficiency | ImageNet | Best accuracy/FLOP trade-off |
vit |
High accuracy | ImageNet | Vision Transformer |
swin_transformer |
Hierarchical features | ImageNet | Shifted windows, multi-scale |
clip |
Zero-shot, multimodal | CLIP data | Text-aligned visual features |
dinov2 |
Frozen feature extraction | Self-supervised | No labels needed for pretraining |
siglip |
Scaled multimodal | SigLIP data | Better scaling than CLIP |
convnextv2 |
Pure-CNN, high accuracy | FCMAE | GRN + masked autoencoder pretraining |
timm |
Any TIMM model | Various | Access to 700+ architectures |
stacked_cnn |
Small data, custom arch | None | Fully configurable CNN stack |
Image Encoder Decision Tree¶
Image encoder selection:
Need multimodal (image + text)? --> clip or siglip
Need frozen features, no labels? --> dinov2
Need max accuracy? --> vit or swin_transformer
Need efficiency / mobile? --> efficientnet
Need pure-CNN, no attention? --> convnextv2
Need a specific TIMM arch? --> timm
Training from scratch? --> stacked_cnn or resnet
Default choice --> resnet or vit
Tips¶
- timm gives access to over 700 architectures from the PyTorch Image Models library. Use it when you need a specific architecture not listed above (e.g., ConvNeXt, EfficientNetV2, MaxViT).
- clip and siglip produce embeddings aligned with text, making them good choices when you combine image and text features in a multimodal model.
- stacked_cnn trains from scratch with no pretrained weights -- use it only when your images are very different from natural images (e.g., medical scans, spectrograms) and you have enough data.
Audio Encoders¶
Audio encoders process sound input, either as raw waveforms or spectrograms.
| Encoder | Best For | Input | Key Feature |
|---|---|---|---|
wav2vec2 |
Speech tasks | Raw waveform | Self-supervised, English-first |
whisper |
Multilingual, noisy audio | Mel spectrogram | 680K hours training data |
hubert |
Speaker/emotion tasks | Raw waveform | Masked prediction pretraining |
parallel_cnn |
Simple classification | Spectrogram | Fast, no pretrained weights needed |
Audio Encoder Decision Tree¶
Audio encoder selection:
Multilingual or noisy audio? --> whisper
Speaker ID or emotion? --> hubert
English speech tasks? --> wav2vec2
Simple audio classification? --> parallel_cnn
Default choice --> whisper
Tips¶
- whisper is the most robust choice for general audio tasks. It was trained on 680,000 hours of multilingual data and handles noise well.
- wav2vec2 and hubert work directly on raw waveforms -- no spectrogram preprocessing needed.
- parallel_cnn is a lightweight option that trains from scratch. Use it when pretrained speech models are overkill (e.g., simple environmental sound classification).
Sequence Encoders¶
Sequence encoders process ordered sequences of tokens (used by sequence, text, and time series features). Choose based on sequence length and quality requirements.
| Encoder | Best For | Complexity | Key Feature |
|---|---|---|---|
stacked_transformer |
Medium sequences | O(n^2) | Full attention, best quality |
mamba |
Long sequences | O(n) | Linear-time SSM |
rnn / stacked_rnn |
Short sequences, low memory | O(n) | Sequential processing |
parallel_cnn |
Local patterns | O(n) | Parallel convolutions, fast |
stacked_cnn |
Hierarchical local patterns | O(n) | Deep convolutional stack |
Sequence Encoder Decision Tree¶
Sequence encoder selection:
Very long sequences (>1000 tokens)? --> mamba
Need best quality, moderate length? --> stacked_transformer
Need speed, local patterns only? --> parallel_cnn or stacked_cnn
Low memory budget? --> rnn
Default choice --> stacked_transformer
Tips¶
- mamba uses a State Space Model (SSM) architecture that scales linearly with sequence length, making it ideal for very long sequences where transformer attention would be too expensive.
- parallel_cnn applies multiple convolution filters in parallel and is the fastest option. It works well when the signal is local (e.g., n-gram patterns).
- stacked_rnn supports LSTM and GRU cells. It is memory-efficient but slow to train due to sequential processing.
Number Encoders¶
Number encoders transform scalar numerical inputs into dense representations for the combiner.
| Encoder | Best For | Key Feature |
|---|---|---|
passthrough |
Pre-normalized data | No transformation |
dense |
Simple learned projection | Single FC layer |
ple |
Tabular SOTA | Piecewise linear, quantile bins |
periodic |
Smooth functions | Learned sinusoidal features |
bins |
Simple discretization | Bin + embedding, fast |
Number Encoder Decision Tree¶
Number encoder selection:
Data already normalized? --> passthrough
Tabular competition / max accuracy? --> ple
Periodic or cyclical patterns? --> periodic
Simple baseline? --> dense
Default choice --> dense or passthrough
Tips¶
- ple (Piecewise Linear Encoding) discretizes numbers into quantile-based bins with learned linear interpolation. It is the current state-of-the-art for tabular data.
- periodic learns sinusoidal representations and works well when the underlying function is smooth.
- passthrough simply forwards the raw number -- use it when your data is already normalized and you want the combiner to handle feature interactions directly.
Category Encoders¶
Category encoders transform categorical variables into dense or sparse representations.
| Encoder | Best For | Key Feature |
|---|---|---|
dense |
Default, medium cardinality | Learned embedding |
sparse |
Large vocab, few active | Sparse embedding lookup |
onehot |
Small vocab, tree models | One-hot vector |
target |
High cardinality | Mean target encoding |
hash |
Extreme cardinality, streaming | Feature hashing, fixed memory |
Category Encoder Decision Tree¶
Category encoder selection:
Very high cardinality (>10K)? --> hash or target
Small vocab (<20 categories)? --> onehot or dense
Need fixed memory footprint? --> hash
Default choice --> dense
Tips¶
- dense is the default and works well for most cases. It learns an embedding vector for each category.
- hash uses feature hashing to map categories to a fixed-size vector. It handles unseen categories gracefully and uses constant memory regardless of vocabulary size.
- target encoding replaces each category with the mean of the target variable. It is effective for high-cardinality features but requires care to avoid target leakage (Ludwig handles this internally with cross-fitting).
Date Encoders¶
Date encoders extract temporal features from date/datetime inputs.
| Encoder | Best For | Key Feature |
|---|---|---|
DateEmbed |
General date features | Embeds each date component (year, month, day, etc.) separately |
DateWave |
Cyclical patterns | Uses sinusoidal encoding for cyclical components (month, day of week) |
- DateEmbed is the default. It creates separate embeddings for year, month, day, weekday, hour, etc. and concatenates them.
- DateWave uses sine/cosine transformations for cyclical date components, which helps the model understand that December and January are adjacent months.
H3 Encoders¶
H3 encoders process Uber H3 geospatial hex indices.
| Encoder | Best For | Key Feature |
|---|---|---|
H3Embed |
Default geospatial encoding | Embeds H3 components (mode, edge, resolution, cells) separately |
H3WeightedSum |
Aggregated representation | Weighted sum of cell embeddings |
H3RNN |
Sequential cell patterns | Processes H3 cell hierarchy with an RNN |
- H3Embed is the simplest and most common choice. It embeds each component of the H3 index independently.
- H3WeightedSum learns weights over cell embeddings, useful when spatial aggregation matters.
- H3RNN treats the hierarchical cell resolution levels as a sequence, capturing multi-scale spatial patterns.
General Recommendations¶
-
Start with defaults. Ludwig's default encoders are reasonable for most tasks. Only change them when you have a specific reason.
-
Use pretrained encoders when possible. For text, image, and audio features, pretrained encoders almost always outperform training from scratch, especially on smaller datasets.
-
Match encoder to data size. Larger encoders (e.g., deberta, swin_transformer) need more data to fine-tune effectively. On small datasets, use smaller encoders or freeze pretrained weights with
trainable: false. -
Consider latency. For production deployment, encoder inference time matters. Distilled models (distilbert, efficientnet) offer good accuracy-speed trade-offs.
-
Use
auto_transformer/timmas escape hatches. If none of the named encoders fit your needs, these meta-encoders let you load any compatible model from the Hugging Face or TIMM ecosystem.