Question Answering

This example shows how to build a question answering model with Ludwig — given a passage of text and a question, the model generates an answer span or free-form text response.

What is Question Answering?¶

Question answering (QA) is the task of automatically generating an answer to a natural language question, typically conditioned on a supporting passage or knowledge source. There are two main flavours:

Extractive QA: the model selects a span from the passage as its answer.
Generative QA / open-domain QA: the model generates the answer text freely, drawing on parametric knowledge or retrieved passages.

Ludwig treats both as text-generation problems — the output is a text feature decoded with a generator.

Datasets¶

Ludwig ships with several QA datasets that can be loaded with one command:

Dataset	Description	Size
`drop`	Discrete Reasoning Over Paragraphs — reading comprehension requiring arithmetic	77K train
`ambig_qa`	AmbigQA — naturally ambiguous open-domain questions with multiple valid answers	14K train
`nq_open`	Natural Questions Open — open-domain QA with Wikipedia answers	88K train
`boolq`	BoolQ — naturally occurring yes/no questions with supporting passage	9K train
`arc_challenge`	ARC Challenge — science exam questions requiring reasoning	1.1K train
`arc_easy`	ARC Easy — science exam questions, easier split	2.3K train
`cmrc2018`	CMRC 2018 — Chinese machine reading comprehension	10K train
`aqua_rat`	AQuA-RAT — algebraic word problems with rationales	97K train

This tutorial uses the DROP dataset, which requires multi-step reasoning to answer questions about a passage. A sample from DROP looks like this:

passage	question	answers_spans
In the 1950s, rock and roll was born...	How many decades after the birth of rock and roll was disco popular?	2
The Broncos scored 14 points in Q1 and 7 in Q2...	How many points did the Broncos score in the first half?	21
...	...	...

Download the Dataset¶

clipython

ludwig datasets download drop

This writes drop.csv to the current directory.

from ludwig.datasets import drop

train_df, val_df, test_df = drop.load()

Define the Ludwig Config¶

The following config fine-tunes a pre-trained BERT encoder on both the passage and question, then concatenates the representations and feeds them to a text generator that produces the answer.

Extractive / short-answerOpen-domain (no passage)Boolean QA (yes/no)

# config.yaml
input_features:
  - name: passage
    type: text
    encoder:
      type: auto_transformer
      pretrained_model_name_or_path: google-bert/bert-base-uncased
      trainable: true
      max_sequence_length: 384
  - name: question
    type: text
    encoder:
      type: auto_transformer
      pretrained_model_name_or_path: google-bert/bert-base-uncased
      trainable: true
      max_sequence_length: 128

output_features:
  - name: answers_spans
    type: text
    decoder:
      type: generator
      max_new_tokens: 32

combiner:
  type: concat

trainer:
  epochs: 5
  batch_size: 16
  learning_rate: 2.0e-5
  learning_rate_scheduler:
    warmup_fraction: 0.06

For open-domain QA datasets like nq_open where no supporting passage is provided, simply use the question as the sole input:

# config_open_domain.yaml
input_features:
  - name: question
    type: text
    encoder:
      type: auto_transformer
      pretrained_model_name_or_path: google/flan-t5-base
      trainable: true

output_features:
  - name: answer
    type: text
    decoder:
      type: generator
      max_new_tokens: 64

trainer:
  epochs: 5
  batch_size: 16
  learning_rate: 3.0e-5

For yes/no QA datasets like boolq, the answer is a binary output which is simpler and faster to train:

# config_boolq.yaml
input_features:
  - name: passage
    type: text
    encoder:
      type: auto_transformer
      pretrained_model_name_or_path: google-bert/bert-base-uncased
      trainable: true
  - name: question
    type: text
    encoder:
      type: auto_transformer
      pretrained_model_name_or_path: google-bert/bert-base-uncased
      trainable: true

output_features:
  - name: answer
    type: binary

combiner:
  type: concat

trainer:
  epochs: 5
  learning_rate: 2.0e-5
  batch_size: 32

Train¶

clipython

ludwig train --config config.yaml --dataset "ludwig://drop"

from ludwig.api import LudwigModel
from ludwig.datasets import drop
import yaml

config = yaml.safe_load(open("config.yaml"))
model = LudwigModel(config)

train_df, val_df, test_df = drop.load()
results = model.train(
    training_set=train_df,
    validation_set=val_df,
    test_set=test_df,
)

Evaluate¶

clipython

ludwig evaluate \
  --model_path results/experiment_run/model \
  --dataset "ludwig://drop" \
  --split test \
  --output_directory eval_results

eval_stats, predictions, _ = model.evaluate(
    dataset=test_df,
    collect_predictions=True,
)
print(eval_stats)

Predict on New Examples¶

clipython

ludwig predict \
  --model_path results/experiment_run/model \
  --dataset my_questions.csv

import pandas as pd

new_questions = pd.DataFrame(
    {
        "passage": ["The Eiffel Tower is located in Paris, France. It was built in 1889."],
        "question": ["In what year was the Eiffel Tower built?"],
    }
)

predictions, _ = model.predict(dataset=new_questions)
print(predictions["answers_spans_predictions"])

Tips¶

Choosing the right encoder¶

Encoder	Best for
`google-bert/bert-base-uncased`	Short passages, extractive QA
`deepset/roberta-base-squad2`	Pre-trained on SQuAD — strong starting point for extractive QA
`google/flan-t5-base`	Open-domain generative QA
`facebook/bart-base`	Generative answers requiring paraphrasing

Long passage handling¶

Many QA datasets have passages longer than a single encoder window. Cap the passage encoder:

- name: passage
  type: text
  encoder:
    type: auto_transformer
    pretrained_model_name_or_path: google-bert/bert-base-uncased
    max_sequence_length: 512   # adjust to GPU memory
    trainable: true

Or use a sliding-window encoder like longformer for very long documents:

- name: passage
  type: text
  encoder:
    type: auto_transformer
    pretrained_model_name_or_path: allenai/longformer-base-4096
    trainable: true

Controlling answer length¶

Tune max_new_tokens in the generator decoder to cap how long generated answers can be:

output_features:
  - name: answers_spans
    type: text
    decoder:
      type: generator
      max_new_tokens: 32     # short factoid answers
      # max_new_tokens: 256  # longer explanations

Using an LLM for zero-shot QA¶

For open-domain QA without training, use Ludwig's LLM backend:

model_type: llm
base_model: meta-llama/Llama-3.1-8B

quantization:
  bits: 4

prompt:
  template: |
    Answer the following question based on the passage.

    Passage: {passage}

    Question: {question}

    Answer:

input_features:
  - name: prompt
    type: text

output_features:
  - name: answers_spans
    type: text

trainer:
  type: none   # zero-shot — no fine-tuning

Dataset	Task	Ludwig name
DROP	Discrete reasoning reading comprehension	`drop`
AmbigQA	Open-domain QA with ambiguous questions	`ambig_qa`
Natural Questions Open	Open-domain Wikipedia QA	`nq_open`
BoolQ	Yes/no QA	`boolq`
BoolQ Standalone	Yes/no QA without passage	`boolq_standalone`
ARC Challenge	Science exam QA	`arc_challenge`
ARC Easy	Science exam QA (easy)	`arc_easy`
CMRC 2018	Chinese machine reading comprehension	`cmrc2018`
AQuA-RAT	Math word problems with rationales	`aqua_rat`
BigBench	Diverse reasoning tasks	`bigbench`

Question Answering

What is Question Answering?¶

Datasets¶

Download the Dataset¶

Define the Ludwig Config¶

Train¶

Evaluate¶

Predict on New Examples¶

Tips¶

Choosing the right encoder¶

Long passage handling¶

Controlling answer length¶

Using an LLM for zero-shot QA¶

Related Ludwig Datasets¶

See Also¶