VLM Fine-Tuning

Vision-Language Model Fine-Tuning¶

Ludwig supports fine-tuning Vision-Language Models (VLMs) — pretrained models that jointly reason over images and text. This lets you adapt models like Qwen2-VL, LLaVA, or Idefics to your own image+text tasks without writing any custom training code.

What is a VLM?¶

A Vision-Language Model combines a vision encoder (typically a ViT or CLIP-style model) with a large language model backbone. The model receives both image patches and text tokens as input and generates text as output. Common tasks include:

Visual Question Answering (VQA) — answer natural-language questions about images
Image Captioning — generate descriptions of images
Instruction following on images — follow complex instructions that reference visual content
Document understanding — parse charts, tables, or screenshots

Prerequisites¶

pip install "ludwig[llm]"

To use models from HuggingFace Hub that require gated access (like Llama-based VLMs):

huggingface-cli login

Dataset format¶

VLM fine-tuning requires a CSV with at least:

Column	Type	Description
`image_path`	string	Absolute or relative path to an image file
`question`	string	Instruction or question about the image
`answer`	string	Expected response (fine-tuning target)

image_path,question,answer
/data/dog.jpg,What breed is this dog?,Golden Retriever
/data/chart.png,What is the highest value on the y-axis?,42

Configuration¶

Minimal config (Qwen2-VL)¶

model_type: llm
base_model: Qwen/Qwen2-VL-7B-Instruct

# Enable multimodal (VLM) mode — required for image+text models
is_multimodal: true
trust_remote_code: true

input_features:
  - name: image_path
    type: image
  - name: question
    type: text

output_features:
  - name: answer
    type: text

adapter:
  type: lora
  r: 16
  alpha: 32
  target_modules: ["q_proj", "v_proj"]

trainer:
  type: finetune
  epochs: 3
  batch_size: 4
  gradient_accumulation_steps: 8
  learning_rate: 2.0e-5
  learning_rate_scheduler:
    decay: cosine
    warmup_fraction: 0.03

# 4-bit quantisation to fit a 7B VLM on a single 24 GB GPU
quantization:
  bits: 4
  quantization_type: nf4
  compute_dtype: bfloat16

generation:
  max_new_tokens: 256
  temperature: 0.0

The key parameter is is_multimodal: true. When set, Ludwig:

Loads the model with AutoModelForVision2Seq instead of AutoModelForCausalLM
Loads the corresponding multimodal processor (AutoProcessor) for joint text+image tokenisation
Handles image patch extraction and interleaving with text tokens automatically

Supported base models¶

Any HuggingFace Vision2Seq model works as a base model:

Model	HuggingFace ID	Notes
Qwen2-VL 7B	`Qwen/Qwen2-VL-7B-Instruct`	Strong performance, requires `trust_remote_code: true`
Qwen2-VL 72B	`Qwen/Qwen2-VL-72B-Instruct`	Best quality, needs 2× A100 with QLoRA
LLaVA 1.5	`llava-hf/llava-1.5-7b-hf`	Widely used baseline
LLaVA-NeXT	`llava-hf/llava-v1.6-mistral-7b-hf`	Improved resolution handling
Idefics-9B	`HuggingFaceM4/idefics-9b-instruct`	Multi-image support
InternVL2	`OpenGVLab/InternVL2-8B`	Strong on document understanding

Training¶

ludwig train \
  --config vlm_config.yaml \
  --dataset vqa_dataset.csv \
  --output_directory ./results

Or via Python:

from ludwig.api import LudwigModel

model = LudwigModel(config="vlm_config.yaml")
results = model.train(
    dataset="vqa_dataset.csv",
    output_directory="./results",
    skip_save_processed_input=True,  # images are large; skip caching
)
print(f"Model saved to: {results.output_directory}")

Inference¶

After training, load the fine-tuned model and run predictions:

import pandas as pd
from ludwig.api import LudwigModel

model = LudwigModel.load("results/experiment_run/model")

test_data = pd.DataFrame(
    [
        {"image_path": "/data/test1.jpg", "question": "What color is the car?"},
        {"image_path": "/data/test2.jpg", "question": "How many people are in this image?"},
    ]
)
predictions, _ = model.predict(dataset=test_data)
print(predictions["answer_predictions"])

Memory requirements¶

Setup	GPU Memory	Recommended GPU
Qwen2-VL-7B, QLoRA 4-bit	~16 GB	RTX 4090, A100-40GB
Qwen2-VL-7B, full fine-tune	~48 GB	2× A100-40GB
Qwen2-VL-72B, QLoRA 4-bit	~80 GB	2× A100-80GB

Distributed training on Ray¶

For large VLMs or large datasets, use Ray for multi-GPU training:

backend:
  type: ray
  trainer:
    use_gpu: true
    num_workers: 2
    resources_per_worker:
      GPU: 1

ludwig train --config vlm_config.yaml --dataset vqa.csv