Pretrained image encoders
Pretrained Image Encoders¶
Ludwig supports three HuggingFace-backed pretrained image encoders alongside the traditional stacked_cnn approach:
| Encoder type | Pretrained source | Default model |
|---|---|---|
stacked_cnn |
None (random init) | — |
dinov2 |
Meta DINO v2 (self-supervised) | facebook/dinov2-base |
clip |
OpenAI CLIP (image-text contrastive) | openai/clip-vit-base-patch32 |
siglip |
Google SigLIP (sigmoid image-text) | google/siglip-base-patch16-224 |
All three pretrained encoders load weights from HuggingFace Hub by default and support both linear probing (frozen backbone) and full fine-tuning (trainable backbone).
When to use pretrained vs from-scratch¶
Use a pretrained encoder when: - Your dataset has fewer than ~10,000 labeled images - Your images are natural photographs (similar domain to pretraining data) - You want fast iteration and low GPU memory usage (linear probe) - You need good accuracy on very few examples (5–50 per class)
Use stacked_cnn (from scratch) when: - You have a large dataset and need maximum architectural flexibility - Your images are highly domain-specific (medical, satellite, microscopy) where pretrained features may not transfer well - You want full control over the convolutional architecture
Encoders¶
Convolutional Stack Encoder (stacked_cnn)¶
Stack of 2D convolutional layers, no pretrained weights. Suitable as a baseline or when training data is abundant.
input_features:
- name: image_path
type: image
encoder:
type: stacked_cnn
use_pretrained: false
DINOv2 (dinov2)¶
Self-supervised vision transformer from Meta (Oquab et al., TMLR 2024), trained on 142M images without image-text pairs. Produces rich general-purpose features that transfer well across diverse image domains.
Linear probe (recommended starting point):
input_features:
- name: image_path
type: image
encoder:
type: dinov2
use_pretrained: true
trainable: false # freeze backbone, train head only
Full fine-tuning:
input_features:
- name: image_path
type: image
encoder:
type: dinov2
use_pretrained: true
trainable: true
Configuration parameters:
| Parameter | Default | Description |
|---|---|---|
use_pretrained |
true |
Load pretrained weights from HuggingFace Hub |
trainable |
true |
Whether encoder parameters are updated during training |
pretrained_model_name_or_path |
facebook/dinov2-base |
HuggingFace model identifier |
Available model variants (larger models produce higher-dimensional outputs and are more accurate but slower):
facebook/dinov2-small— 22M params, 384-dim outputfacebook/dinov2-base— 86M params, 768-dim output (default)facebook/dinov2-large— 307M params, 1024-dim outputfacebook/dinov2-giant— 1.1B params, 1536-dim output
CLIP (clip)¶
Vision transformer from OpenAI (Radford et al., ICML 2021), trained on 400M image-text pairs using contrastive learning. Produces embeddings aligned with text in a shared latent space.
Best for: image-text retrieval, zero-shot classification, multimodal fusion tasks.
input_features:
- name: image_path
type: image
encoder:
type: clip
use_pretrained: true
trainable: false
Configuration parameters:
| Parameter | Default | Description |
|---|---|---|
use_pretrained |
true |
Load pretrained weights from HuggingFace Hub |
trainable |
true |
Whether encoder parameters are updated during training |
pretrained_model_name_or_path |
openai/clip-vit-base-patch32 |
HuggingFace model identifier |
Available model variants:
openai/clip-vit-base-patch32— ViT-B/32, 768-dim output (default)openai/clip-vit-base-patch16— ViT-B/16, higher resolution patchesopenai/clip-vit-large-patch14— ViT-L/14, 1024-dim output
SigLIP (siglip)¶
Vision transformer from Google (Zhai et al., ICCV 2023), which improves on CLIP by replacing the softmax contrastive loss with a per-image sigmoid loss. This removes dependence on global batch statistics and enables better scaling to large batch sizes and model sizes.
Best for: similar to CLIP, but often outperforms it on downstream classification. Particularly strong at small model sizes.
input_features:
- name: image_path
type: image
encoder:
type: siglip
use_pretrained: true
trainable: false
Configuration parameters:
| Parameter | Default | Description |
|---|---|---|
use_pretrained |
true |
Load pretrained weights from HuggingFace Hub |
trainable |
true |
Whether encoder parameters are updated during training |
pretrained_model_name_or_path |
google/siglip-base-patch16-224 |
HuggingFace model identifier |
Available model variants:
google/siglip-base-patch16-224— ViT-B/16 at 224px, 768-dim output (default)google/siglip-large-patch16-256— ViT-L/16 at 256px, 1024-dim outputgoogle/siglip-so400m-patch14-384— SO400M at 384px, 1152-dim output (highest quality)
Linear probing vs fine-tuning¶
Both modes are enabled by the trainable parameter on the encoder config.
Linear probing (trainable: false)¶
The pretrained backbone is completely frozen. Only the Ludwig output head (a small fully connected layer mapping encoder output to class logits) has trainable parameters.
Advantages: - Very fast training (no backprop through the backbone) - Low GPU memory (no activation storage for the backbone) - Resistant to overfitting on small datasets - Good accuracy even with fewer than 50 labeled examples per class
Disadvantages: - Cannot adapt the backbone features to your specific domain - Performance ceiling lower than full fine-tuning with sufficient data
Recommended settings: higher learning rate (0.001–0.01), more epochs (10–20)
Full fine-tuning (trainable: true)¶
Gradients flow through the entire encoder. All parameters are updated.
Advantages: - Higher accuracy ceiling, especially on domain-specific images - Can adapt pretrained features to new domains
Disadvantages: - Requires more GPU memory (stores activations for all backbone layers) - Risk of catastrophic forgetting with too high a learning rate - Requires more labeled data to avoid overfitting
Recommended settings: lower learning rate (1e-4 to 5e-5), fewer epochs (3–10), use early stopping
Decision guide¶
Do you have fewer than 1,000 labeled images?
YES → Use linear probe (trainable: false)
NO → Try fine-tuning; fall back to linear probe if overfitting
Is your domain very different from natural photographs?
YES → Fine-tuning may be necessary to adapt features
NO → Linear probe likely sufficient
Do you have limited GPU memory (<8 GB)?
YES → Use linear probe
NO → Either mode works
Performance expectations¶
The following numbers are approximate, based on the beans dataset (~1,000 training images, 3 classes) on a T4 GPU:
| Encoder | Mode | Accuracy | Train time | Peak GPU |
|---|---|---|---|---|
stacked_cnn |
from scratch | ~0.65–0.75 | ~5 min | ~1 GB |
dinov2 |
linear probe | ~0.90–0.95 | ~2 min | ~2 GB |
dinov2 |
fine-tuned | ~0.93–0.97 | ~5 min | ~6 GB |
clip |
linear probe | ~0.85–0.92 | ~2 min | ~2 GB |
siglip |
linear probe | ~0.87–0.93 | ~2 min | ~2 GB |
Note
Results vary significantly with dataset size, domain, and hyperparameters. Always run your own experiments.
Saved weights and checkpoints¶
All three pretrained encoders set saved_weights_in_checkpoint: false by default. When you save a trained Ludwig model and reload it, Ludwig automatically sets this to true to load encoder weights from the checkpoint rather than re-downloading from HuggingFace. This means trained models are fully self-contained.
Example notebook¶
See the Pretrained Image Encoders example notebook for a complete walkthrough comparing all four approaches on the beans plant disease dataset, including a few-shot experiment with only 15 training examples.