Multilingual NLP
This example shows how to train a single Ludwig model on multilingual data — text from dozens of languages — using multilingual pre-trained encoders.
Why Multilingual NLP?¶
Many real-world applications must handle user input from multiple languages simultaneously. A multilingual model:
- Eliminates the need to train and maintain a separate model per language
- Benefits from cross-lingual transfer — training signal from high-resource languages helps lower-resource ones
- Scales easily as new languages are added to a dataset
Ludwig supports multilingual NLP out of the box through its auto_transformer encoder, which can load any HuggingFace model including multilingual ones like bert-base-multilingual-cased and xlm-roberta-base.
Datasets¶
Ludwig ships with multilingual NLP datasets spanning dozens of languages:
| Dataset | Languages | Task | Size |
|---|---|---|---|
amazon_massive_intent |
51 | Intent classification (60 classes) | 587K train |
amazon_massive_scenario |
51 | Scenario classification (18 classes) | 587K train |
amazon_science_massive |
51 | Intent classification (60 classes) | 587K train |
belebele |
122 | Reading comprehension (multiple choice) | 900 per language |
belebele_fr |
1 (French) | Reading comprehension | 900 test |
wikiann_en |
English | NER (IOB2) | 20K train |
wikiann_de |
German | NER (IOB2) | 20K train |
multinerd |
10 | Fine-grained NER (31 entity types) | 134K train |
mls_german |
German | Speech recognition (ASR) | 469K train |
bornholm_bitext |
Danish/Bornholmish | Translation | 19K pairs |
europarl_bg_cs |
Bulgarian/Czech | Translation | 400K pairs |
clue_afqmc |
Chinese | Sentence similarity | 34K train |
cmrc2018 |
Chinese | Reading comprehension | 10K train |
alpaca_gpt4_zh |
Chinese | Instruction tuning | 42K train |
This tutorial uses the Amazon MASSIVE multilingual intent classification dataset, which spans 51 languages and 60 intent classes — one of the most comprehensive multilingual NLP benchmarks available.
A sample of the dataset:
| locale | utt | intent |
|---|---|---|
| en-US | set an alarm for seven am | alarm_set |
| de-DE | stelle einen alarm für sieben uhr ein | alarm_set |
| zh-CN | 设置七点的闹钟 | alarm_set |
| ar-SA | اضبط منبهاً للساعة السابعة صباحاً | alarm_set |
| ... | ... | ... |
Download the Dataset¶
Downloads the dataset and writes amazon_massive_intent.csv in the current directory.
ludwig datasets download amazon_massive_intent
from ludwig.datasets import amazon_massive_intent
train_df, val_df, test_df = amazon_massive_intent.load(split=True)
print(f"Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}")
print(train_df.head())
Define the Ludwig Config¶
The key to good multilingual performance is using a multilingual pre-trained encoder. Two strong options:
bert-base-multilingual-cased: Covers 104 languages, good all-around baselinexlm-roberta-base: Stronger than mBERT, covers 100 languages, generally better for low-resource languages
# config.yaml
input_features:
- name: utt
type: text
encoder:
type: auto_transformer
pretrained_model_name_or_path: google-bert/bert-base-multilingual-cased
trainable: true
max_sequence_length: 128
output_features:
- name: intent
type: category
trainer:
epochs: 10
learning_rate: 2.0e-5
batch_size: 64
learning_rate_scheduler:
warmup_fraction: 0.06
# config_xlmr.yaml
input_features:
- name: utt
type: text
encoder:
type: auto_transformer
pretrained_model_name_or_path: FacebookAI/xlm-roberta-base
trainable: true
max_sequence_length: 128
output_features:
- name: intent
type: category
trainer:
epochs: 10
learning_rate: 2.0e-5
batch_size: 64
learning_rate_scheduler:
warmup_fraction: 0.06
Train only on English, evaluate on all languages. This tests zero-shot transfer.
# config_zeroshot.yaml
# First filter amazon_massive_intent to en-US locale, then train:
input_features:
- name: utt
type: text
encoder:
type: auto_transformer
pretrained_model_name_or_path: FacebookAI/xlm-roberta-base
trainable: true
output_features:
- name: intent
type: category
trainer:
epochs: 15
learning_rate: 2.0e-5
batch_size: 32
Train¶
ludwig train --config config.yaml --dataset "ludwig://amazon_massive_intent"
import yaml
from ludwig.api import LudwigModel
from ludwig.datasets import amazon_massive_intent
config = yaml.safe_load(open("config.yaml"))
model = LudwigModel(config)
train_df, val_df, test_df = amazon_massive_intent.load(split=True)
results = model.train(
training_set=train_df,
validation_set=val_df,
test_set=test_df,
)
Evaluate¶
Overall evaluation across all 51 languages:
ludwig evaluate \
--model_path results/experiment_run/model \
--dataset "ludwig://amazon_massive_intent" \
--split test \
--output_directory eval_results
Evaluate per language by filtering the test set:
import pandas as pd
train_df, val_df, test_df = amazon_massive_intent.load(split=True)
# Overall
eval_stats, predictions, _ = model.evaluate(test_df, collect_predictions=True)
print("Overall accuracy:", eval_stats["intent"]["accuracy"])
# Per language
for locale in test_df["locale"].unique():
subset = test_df[test_df["locale"] == locale]
stats, _, _ = model.evaluate(subset)
print(f"{locale}: {stats['intent']['accuracy']:.3f}")
Predict on New Text¶
ludwig predict \
--model_path results/experiment_run/model \
--dataset my_utterances.csv
import pandas as pd
new_data = pd.DataFrame({
"utt": [
"set an alarm for 8am", # English
"stell einen wecker für 8 uhr", # German
"设置八点的闹钟", # Chinese
"fija una alarma para las 8", # Spanish
]
})
predictions, _ = model.predict(new_data)
print(predictions[["utt_predictions", "utt_probabilities"]])
Tips¶
Choosing a multilingual encoder¶
| Model | Languages | Notes |
|---|---|---|
google-bert/bert-base-multilingual-cased |
104 | Fast, well-tested baseline |
FacebookAI/xlm-roberta-base |
100 | Generally stronger, especially low-resource |
FacebookAI/xlm-roberta-large |
100 | Larger, better quality, slower |
microsoft/mdeberta-v3-base |
100 | Strong multilingual model based on DeBERTa |
intfloat/multilingual-e5-base |
100 | Optimised for retrieval/similarity tasks |
Language-balanced training¶
The MASSIVE dataset contains the same number of examples per language, so the model sees equal training signal for every locale. If your dataset has imbalanced language counts, consider upsampling under-represented languages:
# Oversample minority locales to 10K examples each
target_count = 10_000
balanced_dfs = []
for locale, group in train_df.groupby("locale"):
if len(group) < target_count:
group = group.sample(target_count, replace=True, random_state=42)
balanced_dfs.append(group)
balanced_train = pd.concat(balanced_dfs).sample(frac=1, random_state=42)
Using language as an extra feature¶
If your dataset has a language/locale column, you can provide it as an additional categorical input:
input_features:
- name: utt
type: text
encoder:
type: auto_transformer
pretrained_model_name_or_path: FacebookAI/xlm-roberta-base
trainable: true
- name: locale
type: category
output_features:
- name: intent
type: category
combiner:
type: concat
Named Entity Recognition across languages¶
For multilingual NER, use the multinerd or wikiann_* datasets with a sequence output feature:
# config_multilingual_ner.yaml
input_features:
- name: sentence
type: text
encoder:
type: auto_transformer
pretrained_model_name_or_path: FacebookAI/xlm-roberta-base
trainable: true
output_features:
- name: ner_tags
type: sequence
trainer:
epochs: 10
learning_rate: 2.0e-5
batch_size: 32
Train with:
ludwig train --config config_multilingual_ner.yaml --dataset "ludwig://multinerd"
Cross-lingual transfer experiment¶
A common research pattern is to train on one high-resource language and evaluate zero-shot on others:
# Train on English only
en_train = train_df[train_df["locale"] == "en-US"]
en_val = val_df[val_df["locale"] == "en-US"]
model.train(training_set=en_train, validation_set=en_val)
# Evaluate on every language
for locale in test_df["locale"].unique():
subset = test_df[test_df["locale"] == locale]
stats, _, _ = model.evaluate(subset)
print(f"{locale}: {stats['intent']['accuracy']:.3f}")
Related Ludwig Datasets¶
| Dataset | Languages | Task | Ludwig name |
|---|---|---|---|
| Amazon MASSIVE (all languages) | 51 | Intent classification | amazon_massive_intent |
| Amazon MASSIVE (scenario) | 51 | Scenario classification | amazon_massive_scenario |
| Amazon MASSIVE (Science split) | 51 | Intent classification | amazon_science_massive |
| Belebele | 122 | Reading comprehension | belebele |
| MultiNERD | 10 | Fine-grained NER | multinerd |
| WikiANN English | English | NER | wikiann_en |
| WikiANN German | German | NER | wikiann_de |
| CLUE AFQMC | Chinese | Sentence similarity | clue_afqmc |
| CMRC 2018 | Chinese | Reading comprehension | cmrc2018 |
| Alpaca GPT-4 Chinese | Chinese | Instruction tuning | alpaca_gpt4_zh |
| Bornholm Bitext | Danish/Bornholmish | Translation | bornholm_bitext |
| Europarl Bulgarian-Czech | Bulgarian/Czech | Translation | europarl_bg_cs |
See Also¶
- Text Classification — single-language text classification
- Named Entity Recognition — sequence tagging
- Machine Translation — translating between languages
- Natural Language Understanding — intent and slot classification