VLM Fine-Tuning
VLM Fine-Tuning¶
This example fine-tunes Qwen2-VL-7B-Instruct on a Visual Question Answering (VQA) task. The model learns to answer natural-language questions about images using 4-bit QLoRA, so it fits on a single 24 GB GPU.
For the full user guide see VLM Fine-Tuning.
Dataset¶
The example uses a VQA dataset formatted as a CSV with three columns:
| Column | Description |
|---|---|
image_path |
Path to image file |
question |
Natural-language question about the image |
answer |
Expected answer (fine-tuning target) |
image_path,question,answer
/data/dog.jpg,What breed is this dog?,Golden Retriever
/data/chart.png,What is the peak value in the chart?,42
/data/receipt.jpg,What is the total amount?,$ 18.50
Using a HuggingFace VQA dataset¶
You can also use datasets from HuggingFace:
from datasets import load_dataset
import pandas as pd
from pathlib import Path
# Load VQA-v2 (small split for demo)
ds = load_dataset("HuggingFaceM4/VQAv2", split="validation[:2000]", trust_remote_code=True)
# Save images locally and build CSV
img_dir = Path("vqa_images")
img_dir.mkdir(exist_ok=True)
rows = []
for i, item in enumerate(ds):
img_path = img_dir / f"{i}.jpg"
item["image"].save(img_path)
rows.append({
"image_path": str(img_path),
"question": item["question"],
"answer": item["multiple_choice_answer"],
})
df = pd.DataFrame(rows)
df.to_csv("vqa_dataset.csv", index=False)
Configuration¶
# vlm_config.yaml
model_type: llm
base_model: Qwen/Qwen2-VL-7B-Instruct
is_multimodal: true
trust_remote_code: true
input_features:
- name: image_path
type: image
- name: question
type: text
output_features:
- name: answer
type: text
adapter:
type: lora
r: 16
alpha: 32
target_modules: ["q_proj", "v_proj"]
trainer:
type: finetune
epochs: 3
batch_size: 4
gradient_accumulation_steps: 8
learning_rate: 2.0e-5
learning_rate_scheduler:
decay: cosine
warmup_fraction: 0.03
quantization:
bits: 4
quantization_type: nf4
compute_dtype: bfloat16
generation:
max_new_tokens: 256
temperature: 0.0
Training¶
pip install "ludwig[llm]"
huggingface-cli login # needed for gated models
ludwig train \
--config vlm_config.yaml \
--dataset vqa_dataset.csv \
--output_directory ./results
Or with the Python API:
from ludwig.api import LudwigModel
model = LudwigModel(config="vlm_config.yaml")
results = model.train(
dataset="vqa_dataset.csv",
output_directory="./results",
skip_save_processed_input=True,
)
print(f"Model saved to: {results.output_directory}")
Inference¶
import pandas as pd
from ludwig.api import LudwigModel
model = LudwigModel.load("results/experiment_run/model")
questions = pd.DataFrame([
{"image_path": "test1.jpg", "question": "What color is the car?"},
{"image_path": "test2.jpg", "question": "How many people are visible?"},
])
predictions, _ = model.predict(dataset=questions)
for q, a in zip(questions["question"], predictions["answer_predictions"]):
print(f"Q: {q}")
print(f"A: {a}\n")
Multi-GPU training¶
For larger models or datasets, scale to multiple GPUs with Ray:
# vlm_config_distributed.yaml — append to the config above
backend:
type: ray
trainer:
use_gpu: true
num_workers: 4
resources_per_worker:
GPU: 1
ray start --head
ludwig train --config vlm_config_distributed.yaml --dataset vqa_dataset.csv
Alternative VLM base models¶
Swap base_model for any HuggingFace Vision2Seq model:
# LLaVA 1.5
base_model: llava-hf/llava-1.5-7b-hf
is_multimodal: true
trust_remote_code: false # LLaVA doesn't need this
# InternVL2 (strong on document tasks)
base_model: OpenGVLab/InternVL2-8B
is_multimodal: true
trust_remote_code: true
See also¶
- VLM Fine-Tuning user guide — full reference
- LLM Fine-Tuning — text-only LLM fine-tuning
- Multi-adapter PEFT — merge specialized VLM adapters