Multilingual NLP

This example shows how to train a single Ludwig model on multilingual data — text from dozens of languages — using multilingual pre-trained encoders.

Why Multilingual NLP?¶

Many real-world applications must handle user input from multiple languages simultaneously. A multilingual model:

Eliminates the need to train and maintain a separate model per language
Benefits from cross-lingual transfer — training signal from high-resource languages helps lower-resource ones
Scales easily as new languages are added to a dataset

Ludwig supports multilingual NLP out of the box through its auto_transformer encoder, which can load any HuggingFace model including multilingual ones like bert-base-multilingual-cased and xlm-roberta-base.

Datasets¶

Ludwig ships with multilingual NLP datasets spanning dozens of languages:

Dataset	Languages	Task	Size
`amazon_massive_intent`	51	Intent classification (60 classes)	587K train
`amazon_massive_scenario`	51	Scenario classification (18 classes)	587K train
`amazon_science_massive`	51	Intent classification (60 classes)	587K train
`belebele`	122	Reading comprehension (multiple choice)	900 per language
`belebele_fr`	1 (French)	Reading comprehension	900 test
`wikiann_en`	English	NER (IOB2)	20K train
`wikiann_de`	German	NER (IOB2)	20K train
`multinerd`	10	Fine-grained NER (31 entity types)	134K train
`mls_german`	German	Speech recognition (ASR)	469K train
`bornholm_bitext`	Danish/Bornholmish	Translation	19K pairs
`europarl_bg_cs`	Bulgarian/Czech	Translation	400K pairs
`clue_afqmc`	Chinese	Sentence similarity	34K train
`cmrc2018`	Chinese	Reading comprehension	10K train
`alpaca_gpt4_zh`	Chinese	Instruction tuning	42K train

This tutorial uses the Amazon MASSIVE multilingual intent classification dataset, which spans 51 languages and 60 intent classes — one of the most comprehensive multilingual NLP benchmarks available.

A sample of the dataset:

locale	utt	intent
en-US	set an alarm for seven am	alarm_set
de-DE	stelle einen alarm für sieben uhr ein	alarm_set
zh-CN	设置七点的闹钟	alarm_set
ar-SA	اضبط منبهاً للساعة السابعة صباحاً	alarm_set
...	...	...

Download the Dataset¶

clipython

Downloads the dataset and writes amazon_massive_intent.csv in the current directory.

ludwig datasets download amazon_massive_intent

from ludwig.datasets import amazon_massive_intent

train_df, val_df, test_df = amazon_massive_intent.load(split=True)
print(f"Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}")
print(train_df.head())

Define the Ludwig Config¶

The key to good multilingual performance is using a multilingual pre-trained encoder. Two strong options:

bert-base-multilingual-cased: Covers 104 languages, good all-around baseline
xlm-roberta-base: Stronger than mBERT, covers 100 languages, generally better for low-resource languages

mBERT (104 languages)XLM-RoBERTa (100 languages, stronger)Zero-shot cross-lingual

# config.yaml
input_features:
  - name: utt
    type: text
    encoder:
      type: auto_transformer
      pretrained_model_name_or_path: google-bert/bert-base-multilingual-cased
      trainable: true
      max_sequence_length: 128

output_features:
  - name: intent
    type: category

trainer:
  epochs: 10
  learning_rate: 2.0e-5
  batch_size: 64
  learning_rate_scheduler:
    warmup_fraction: 0.06

# config_xlmr.yaml
input_features:
  - name: utt
    type: text
    encoder:
      type: auto_transformer
      pretrained_model_name_or_path: FacebookAI/xlm-roberta-base
      trainable: true
      max_sequence_length: 128

output_features:
  - name: intent
    type: category

trainer:
  epochs: 10
  learning_rate: 2.0e-5
  batch_size: 64
  learning_rate_scheduler:
    warmup_fraction: 0.06

Train only on English, evaluate on all languages. This tests zero-shot transfer.

# config_zeroshot.yaml
# First filter amazon_massive_intent to en-US locale, then train:
input_features:
  - name: utt
    type: text
    encoder:
      type: auto_transformer
      pretrained_model_name_or_path: FacebookAI/xlm-roberta-base
      trainable: true

output_features:
  - name: intent
    type: category

trainer:
  epochs: 15
  learning_rate: 2.0e-5
  batch_size: 32

Train¶

clipython

ludwig train --config config.yaml --dataset "ludwig://amazon_massive_intent"

import yaml
from ludwig.api import LudwigModel
from ludwig.datasets import amazon_massive_intent

config = yaml.safe_load(open("config.yaml"))
model = LudwigModel(config)

train_df, val_df, test_df = amazon_massive_intent.load(split=True)

results = model.train(
    training_set=train_df,
    validation_set=val_df,
    test_set=test_df,
)

Evaluate¶

clipython

Overall evaluation across all 51 languages:

ludwig evaluate \
  --model_path results/experiment_run/model \
  --dataset "ludwig://amazon_massive_intent" \
  --split test \
  --output_directory eval_results

Evaluate per language by filtering the test set:

import pandas as pd

train_df, val_df, test_df = amazon_massive_intent.load(split=True)

# Overall
eval_stats, predictions, _ = model.evaluate(test_df, collect_predictions=True)
print("Overall accuracy:", eval_stats["intent"]["accuracy"])

# Per language
for locale in test_df["locale"].unique():
    subset = test_df[test_df["locale"] == locale]
    stats, _, _ = model.evaluate(subset)
    print(f"{locale}: {stats['intent']['accuracy']:.3f}")

Predict on New Text¶

clipython

ludwig predict \
  --model_path results/experiment_run/model \
  --dataset my_utterances.csv

import pandas as pd

new_data = pd.DataFrame(
    {
        "utt": [
            "set an alarm for 8am",  # English
            "stell einen wecker für 8 uhr",  # German
            "设置八点的闹钟",  # Chinese
            "fija una alarma para las 8",  # Spanish
        ]
    }
)

predictions, _ = model.predict(new_data)
print(predictions[["utt_predictions", "utt_probabilities"]])

Tips¶

Choosing a multilingual encoder¶

Model	Languages	Notes
`google-bert/bert-base-multilingual-cased`	104	Fast, well-tested baseline
`FacebookAI/xlm-roberta-base`	100	Generally stronger, especially low-resource
`FacebookAI/xlm-roberta-large`	100	Larger, better quality, slower
`microsoft/mdeberta-v3-base`	100	Strong multilingual model based on DeBERTa
`intfloat/multilingual-e5-base`	100	Optimised for retrieval/similarity tasks

Language-balanced training¶

The MASSIVE dataset contains the same number of examples per language, so the model sees equal training signal for every locale. If your dataset has imbalanced language counts, consider upsampling under-represented languages:

# Oversample minority locales to 10K examples each
target_count = 10_000
balanced_dfs = []
for locale, group in train_df.groupby("locale"):
    if len(group) < target_count:
        group = group.sample(target_count, replace=True, random_state=42)
    balanced_dfs.append(group)
balanced_train = pd.concat(balanced_dfs).sample(frac=1, random_state=42)

Using language as an extra feature¶

If your dataset has a language/locale column, you can provide it as an additional categorical input:

input_features:
  - name: utt
    type: text
    encoder:
      type: auto_transformer
      pretrained_model_name_or_path: FacebookAI/xlm-roberta-base
      trainable: true
  - name: locale
    type: category
output_features:
  - name: intent
    type: category
combiner:
  type: concat

Named Entity Recognition across languages¶

For multilingual NER, use the multinerd or wikiann_* datasets with a sequence output feature:

# config_multilingual_ner.yaml
input_features:
  - name: sentence
    type: text
    encoder:
      type: auto_transformer
      pretrained_model_name_or_path: FacebookAI/xlm-roberta-base
      trainable: true

output_features:
  - name: ner_tags
    type: sequence

trainer:
  epochs: 10
  learning_rate: 2.0e-5
  batch_size: 32

Train with:

ludwig train --config config_multilingual_ner.yaml --dataset "ludwig://multinerd"

Cross-lingual transfer experiment¶

A common research pattern is to train on one high-resource language and evaluate zero-shot on others:

# Train on English only
en_train = train_df[train_df["locale"] == "en-US"]
en_val = val_df[val_df["locale"] == "en-US"]

model.train(training_set=en_train, validation_set=en_val)

# Evaluate on every language
for locale in test_df["locale"].unique():
    subset = test_df[test_df["locale"] == locale]
    stats, _, _ = model.evaluate(subset)
    print(f"{locale}: {stats['intent']['accuracy']:.3f}")

Dataset	Languages	Task	Ludwig name
Amazon MASSIVE (all languages)	51	Intent classification	`amazon_massive_intent`
Amazon MASSIVE (scenario)	51	Scenario classification	`amazon_massive_scenario`
Amazon MASSIVE (Science split)	51	Intent classification	`amazon_science_massive`
Belebele	122	Reading comprehension	`belebele`
MultiNERD	10	Fine-grained NER	`multinerd`
WikiANN English	English	NER	`wikiann_en`
WikiANN German	German	NER	`wikiann_de`
CLUE AFQMC	Chinese	Sentence similarity	`clue_afqmc`
CMRC 2018	Chinese	Reading comprehension	`cmrc2018`
Alpaca GPT-4 Chinese	Chinese	Instruction tuning	`alpaca_gpt4_zh`
Bornholm Bitext	Danish/Bornholmish	Translation	`bornholm_bitext`
Europarl Bulgarian-Czech	Bulgarian/Czech	Translation	`europarl_bg_cs`

Multilingual NLP

Why Multilingual NLP?¶

Datasets¶

Download the Dataset¶

Define the Ludwig Config¶

Train¶

Evaluate¶

Predict on New Text¶

Tips¶

Choosing a multilingual encoder¶

Language-balanced training¶

Using language as an extra feature¶

Named Entity Recognition across languages¶

Cross-lingual transfer experiment¶

Related Ludwig Datasets¶

See Also¶