Dataset Zoo

The Ludwig Dataset Zoo provides datasets that can be directly plugged into a Ludwig model.

The simplest way to use a dataset is to reference it as a URI when specifying the training set:

ludwig train --dataset ludwig://reuters ...

Any Ludwig dataset can be specified as a URI of the form ludwig://<dataset>.

Datasets can also be programatically imported and loaded into a Pandas DataFrame using the .load() method:

from ludwig.datasets import reuters

# Loads into single dataframe with a 'split' column:
dataset_df = reuters.load()

# Loads into split dataframes:
train_df, test_df, _ = reuters.load(split=True)

The ludwig.datasets API also provides functions to list, describe, and get datasets. For example:

import ludwig.datasets

# Gets a list of all available dataset names.
dataset_names = ludwig.datasets.list_datasets()

# Prints the description of the titanic dataset.
print(ludwig.datasets.describe_dataset("titanic"))

titanic = ludwig.datasets.get_dataset("titanic")

# Loads into single dataframe with a 'split' column:
dataset_df = titanic.load()

# Loads into split dataframes:
train_df, test_df, _ = titanic.load(split=True)

Kaggle Datasets¶

Some datasets are hosted on Kaggle and require a kaggle account. To use these, you'll need to set up Kaggle credentials in your environment. If the dataset is part of a Kaggle competition, you'll need to accept the terms on the competition page.

To check programmatically, datasets have an .is_kaggle_dataset property.

Downloading, Processing, and Exporting¶

Datasets are first downloaded into LUDWIG_CACHE, which may be set as an environment variable and defaults to $HOME/.ludwig_cache.

Datasets are automatically loaded, processed, and re-saved as parquet files in the cache.

To export the processed dataset, including any files it depends on, use the .export(output_directory) method. This is recommended if the dataset contains media files like images or audio files. File paths are relative to the working directory of the training process.

from ludwig.datasets import twitter_bots

# Exports twitter bots dataset and image files to the current working directory.
twitter_bots.export(".")

End-to-end Example¶

Here's an end-to-end example of training a model using the MNIST dataset:

from ludwig.api import LudwigModel
from ludwig.datasets import mnist

# Initializes a Ludwig model
config = {
    "input_features": [{"name": "image_path", "type": "image"}],
    "output_features": [{"name": "label", "type": "category"}],
}
model = LudwigModel(config)

# Loads and splits MNIST dataset
training_set, test_set, _ = mnist.load(split=True)

# Exports the mnist image files to the current working directory.
mnist.export(".")

# Runs model training
results = model.train(training_set=training_set, test_set=test_set, model_name="mnist_model")
train_stats = results.train_stats

Dataset Splits¶

All datasets in the dataset zoo are provided with a default train/validation/test split. When loading with split=False, the default split will be returned (and is guaranteed to be the same every time). With split=True, Ludwig will randomly re-split the dataset.

Note

Some benchmark or contest datasets are released with held-out test set labels. In other words, the train and validation splits have labels, but the test set does not. Most Kaggle contest datasets have this unlabeled test set.

Splits:

train: Data to train on. Required, must have labels.
validation: Subset of dataset to evaluate while training. Optional, must have labels.
test: Held out from model development, used for later testing. Optional, may not be labeled.

Zoo Datasets¶

Ludwig ships with 590 datasets spanning 15 task categories. Each dataset can be loaded with ludwig datasets download <name> or from ludwig.datasets import <name>.

Text Classification (286 datasets)¶

Dataset	Source	Description
`aegis_safety`	HuggingFace	NVIDIA Aegis 2.0: AI content safety classification. 30K train examples.
`ag_news_hf`	HuggingFace	AG News 4-class topic classification (HF version)
`agnews`	Download	News articles categorized as "World", "Sports", "Business", and "Science".
`amazon_massive_intent`	HuggingFace	Amazon MASSIVE multilingual intent classification (60 intents, all 51 languages combined). 587K trai
`amazon_massive_scenario`	HuggingFace	Amazon MASSIVE multilingual scenario classification (18 scenarios, 60 languages). 587K train.
`amazon_polarity`	HuggingFace	Amazon product review polarity; positive/negative
`amazon_reviews`	Download	The Amazon Reviews dataset
`amazon_science_massive`	HuggingFace	Amazon MASSIVE multilingual intent classification dataset (51 languages, 60 intents).
`anli`	HuggingFace	Adversarial NLI; iteratively adversarial NLI benchmark
`aqua_rat`	HuggingFace	AQuA-RAT: algebraic word problems 5-way MC with rationales. 97K train examples.
`arc_challenge`	HuggingFace	AI2 ARC-Challenge harder science QA. 4-way multiple choice.
`arc_easy`	HuggingFace	AI2 ARC-Easy science QA. 4-way multiple choice. A/B/C/D.
`banking77`	HuggingFace	Banking77; 77-class banking customer intent classification
`banking77_legacy`	HuggingFace	Banking77 (legacy) 77-class banking intent classification. 10K train examples.
`belebele`	HuggingFace	Belebele multilingual reading comprehension; multiple-choice
`belebele_fr`	HuggingFace	Belebele French; multilingual reading comprehension
`bitext_customer_intent`	HuggingFace	Bitext customer support intent detection (26K train, 27 intents)
`bitext_customer_support`	HuggingFace	Bitext customer support: 27 intents, 10 categories. 26K train examples.
`boolq`	HuggingFace	Boolean Questions; reading comprehension yes/no questions from Google
`brazilian_toxic_tweets`	HuggingFace	Brazilian toxic tweets binary classification. 8K train examples.
`ccnews`	HuggingFace	CC-News: multilingual web news articles from Common Crawl. 1.9M train examples.
`climate_fever`	HuggingFace	ClimateFEVER: climate change claim verification dataset.
`climate_sentiment`	HuggingFace	ClimateBERT climate-related text sentiment analysis
`clinc_oos`	HuggingFace	CLINC OOS: 150-class intent classification with out-of-scope detection. 10K train examples.
`clue_afqmc`	HuggingFace	CLUE AFQMC: Chinese short text similarity for financial domain. 34K train examples.
`code_contests`	HuggingFace	DeepMind code contests: competitive programming problems with difficulty classification. 3.7K train
`coig_cqia`	HuggingFace	COIG-CQIA: Chinese instruction dataset with domain/task type. 1111 examples.
`cola`	HuggingFace	Corpus of Linguistic Acceptability; grammatically acceptable or not
`commitment_bank`	HuggingFace	CommitmentBank; textual entailment with 3-way classification
`commonsense_qa`	HuggingFace	CommonsenseQA 5-way multiple choice. Question → A/B/C/D/E.
`copa`	HuggingFace	Choice Of Plausible Alternatives; causal commonsense reasoning
`customer_reviews`	HuggingFace	Customer Reviews; product review sentiment binary classification
`dair_emotion`	HuggingFace	DAIR.AI Emotion: 6-class emotion classification from English Twitter messages.
`data_scientist_salary`	Download	The training data and test data comprise of 19802 samples and of 6601 samples each from the
`databench_qa`	HuggingFace	DataBench: data analysis question-answer type classification. 1830 train examples.
`databricks_dolly_15k`	HuggingFace	Databricks Dolly 15K instruction-response pairs with 8 task categories (QA, summarization, etc.).
`dbpedia`	Download	The DBPedia Ontology dataset.
`dbpedia_14`	HuggingFace	DBpedia 14 ontology text classification; 14 categories
`dolly_15k`	HuggingFace	Databricks Dolly 15K: 15K instruction-response pairs with task category labels.
`emotion`	HuggingFace	Twitter emotion classification; 6 classes: joy, sadness, anger, fear, surprise, love
`enron_spam`	HuggingFace	Enron Spam; email spam/ham classification
`fake_news_detection`	HuggingFace	Fake news detection; real vs fake news articles
`farstail_nli`	HuggingFace	FarsTail: Persian NLI (entailment/neutral/contradiction). 1K test examples.
`fever`	Download	FEVER: a Large-scale Dataset for Fact Extraction and VERification
`fever_gold`	HuggingFace	FEVER fact verification gold evidence. Claim → SUPPORTS/REFUTES/NOT_ENOUGH_INFO.
`financial_phrasebank`	HuggingFace	Financial classification: sentiment analysis of financial text.
`flores_101`	HuggingFace	FLoRes-101 English evaluation sentences. Topic classification: predict
`goemotions`	Download	GoEmotions: A Dataset for Fine-Grained Emotion Classification.
`google_quest_qa`	Download	Google QUEST Q&A Labeling
`hate_speech18`	HuggingFace	Hate Speech 18 (SetFit): binary hate speech detection on online forum data.
`hatespeech_offensive`	HuggingFace	Hate Speech and Offensive Language: 25K tweets with 3-class labels
`hellaswag`	HuggingFace	HellaSwag commonsense NLI. Activity + context → correct ending. A/B/C/D.
`imdb_sentiment`	HuggingFace	IMDB movie review sentiment; positive/negative binary classification
`indic_glue`	HuggingFace	IndicGLUE Telugu sentiment classification (actsa-sc). 4328 train examples.
`klue_topic`	HuggingFace	KLUE YNAT; Korean news topic classification; 7 categories
`kmmlu`	HuggingFace	KMMLU: Korean Massive Multitask Language Understanding, 4-way MC. 45 train examples per subset.
`m3cot`	HuggingFace	M3CoT: multimodal multi-domain QA category classification. 7K examples.
`m_mmlu`	HuggingFace	Multilingual MMLU (Arabic): 4-way multiple choice academic questions. 274 train examples.
`medmcqa`	HuggingFace	Medical entrance exam QA; 4-choice medical questions
`melbourne_airbnb`	Download	Melbourne Airbnb Open Data
`mmlu`	HuggingFace	MMLU massive multitask benchmark. 57 tasks, 4-way multiple choice.
`mmlu_lighteval`	HuggingFace	MMLU: Massive Multitask Language Understanding, 57 academic subjects, 4-way MC. 99K auxiliary train.
`mmlu_pro`	HuggingFace	MMLU-Pro harder multitask benchmark. 10-way multiple choice.
`mnli`	HuggingFace	Multi-Genre Natural Language Inference; premise + hypothesis -> entailment/neutral/contradiction
`mrpc`	HuggingFace	Microsoft Research Paraphrase Corpus; paraphrase detection
`mteb_amazon_reviews_class_de`	HuggingFace	MTEB Amazon Reviews Classification (German): 5-class star rating classification.
`mteb_amazon_reviews_class_en`	HuggingFace	MTEB Amazon Reviews Classification (English): 5-class star rating classification.
`mteb_amazon_reviews_class_es`	HuggingFace	MTEB Amazon Reviews Classification (Spanish): 5-class star rating classification.
`mteb_amazon_reviews_class_fr`	HuggingFace	MTEB Amazon Reviews Classification (French): 5-class star rating classification.
`mteb_amazon_reviews_class_ja`	HuggingFace	MTEB Amazon Reviews Classification (Japanese): 5-class star rating classification.
`mteb_amazon_reviews_class_zh`	HuggingFace	MTEB Amazon Reviews Classification (Chinese): 5-class star rating classification.
`mteb_cyrillic_turkic`	HuggingFace	MTEB Cyrillic Turkic Language Classification: language identification for Cyrillic-script Turkic lan
`mteb_emotion`	HuggingFace	MTEB emotion task; 6-class emotion from tweets
`mteb_financial_phrasebank`	HuggingFace	MTEB Financial Phrasebank Classification: financial news sentiment classification.
`mteb_frenk_en`	HuggingFace	MTEB Frenk English Classification: hate speech detection in English.
`mteb_frenk_hr`	HuggingFace	MTEB Frenk Croatian Classification: hate speech detection in Croatian.
`mteb_frenk_sl`	HuggingFace	MTEB Frenk Slovenian Classification: hate speech detection in Slovenian.
`mteb_georeview`	HuggingFace	MTEB Georeview Classification: Russian-language geographic review sentiment classification.
`mteb_greek_legal`	HuggingFace	MTEB Greek Legal Code Classification: Greek legal code topic classification.
`mteb_ita_casehold`	HuggingFace	MTEB ItaCasehold Classification: Italian legal case holding classification.
`mteb_jd_review`	HuggingFace	MTEB JDReview: Chinese product review sentiment classification from JD.com.
`mteb_kor_sarcasm`	HuggingFace	MTEB Korean Sarcasm Classification: sarcasm detection in Korean social media.
`mteb_language_class`	HuggingFace	MTEB Language Classification: language identification from short text.
`mteb_massive_intent_af`	HuggingFace	MTEB MASSIVE Intent Classification (Afrikaans): task-oriented dialog intent prediction.
`mteb_massive_intent_am`	HuggingFace	MTEB MASSIVE Intent Classification (Amharic): task-oriented dialog intent prediction.
`mteb_massive_intent_ar`	HuggingFace	MTEB MASSIVE Intent (Arabic): 60-class intent classification from voice assistant queries.
`mteb_massive_intent_az`	HuggingFace	MTEB MASSIVE Intent Classification (Azerbaijani): task-oriented dialog intent prediction.
`mteb_massive_intent_bn`	HuggingFace	MTEB MASSIVE Intent Classification (Bengali): task-oriented dialog intent prediction.
`mteb_massive_intent_cy`	HuggingFace	MTEB MASSIVE Intent Classification (Welsh): task-oriented dialog intent prediction.
`mteb_massive_intent_da`	HuggingFace	MTEB MASSIVE Intent Classification (Danish): task-oriented dialog intent prediction.
`mteb_massive_intent_de`	HuggingFace	MTEB MASSIVE Intent (German): 60-class intent classification from voice assistant queries.
`mteb_massive_intent_el`	HuggingFace	MTEB MASSIVE Intent Classification (Greek): task-oriented dialog intent prediction.
`mteb_massive_intent_en`	HuggingFace	MTEB MASSIVE Intent (English): 60-class intent classification from voice assistant queries.
`mteb_massive_intent_es`	HuggingFace	MTEB MASSIVE Intent (Spanish): 60-class intent classification from voice assistant queries.
`mteb_massive_intent_fa`	HuggingFace	MTEB MASSIVE Intent Classification (Farsi): task-oriented dialog intent prediction.
`mteb_massive_intent_fi`	HuggingFace	MTEB MASSIVE Intent Classification (Finnish): task-oriented dialog intent prediction.
`mteb_massive_intent_fr`	HuggingFace	MTEB MASSIVE Intent (French): 60-class intent classification from voice assistant queries.
`mteb_massive_intent_he`	HuggingFace	MTEB MASSIVE Intent Classification (Hebrew): task-oriented dialog intent prediction.
`mteb_massive_intent_hi`	HuggingFace	MTEB MASSIVE Intent Classification (Hindi): task-oriented dialog intent prediction.
`mteb_massive_intent_hu`	HuggingFace	MTEB MASSIVE Intent Classification (Hungarian): task-oriented dialog intent prediction.
`mteb_massive_intent_hy`	HuggingFace	MTEB MASSIVE Intent Classification (Armenian): task-oriented dialog intent prediction.
`mteb_massive_intent_id`	HuggingFace	MTEB MASSIVE Intent Classification (Indonesian): task-oriented dialog intent prediction.
`mteb_massive_intent_is`	HuggingFace	MTEB MASSIVE Intent Classification (Icelandic): task-oriented dialog intent prediction.
`mteb_massive_intent_it`	HuggingFace	MTEB MASSIVE Intent Classification (Italian): task-oriented dialog intent prediction.
`mteb_massive_intent_ja`	HuggingFace	MTEB MASSIVE Intent Classification (Japanese): task-oriented dialog intent prediction.
`mteb_massive_intent_jv`	HuggingFace	MTEB MASSIVE Intent Classification (Javanese): task-oriented dialog intent prediction.
`mteb_massive_intent_ka`	HuggingFace	MTEB MASSIVE Intent Classification (Georgian): task-oriented dialog intent prediction.
`mteb_massive_intent_km`	HuggingFace	MTEB MASSIVE Intent Classification (Khmer): task-oriented dialog intent prediction.
`mteb_massive_intent_kn`	HuggingFace	MTEB MASSIVE Intent Classification (Kannada): task-oriented dialog intent prediction.
`mteb_massive_intent_ko`	HuggingFace	MTEB MASSIVE Intent Classification (Korean): task-oriented dialog intent prediction.
`mteb_massive_intent_lv`	HuggingFace	MTEB MASSIVE Intent Classification (Latvian): task-oriented dialog intent prediction.
`mteb_massive_intent_ml`	HuggingFace	MTEB MASSIVE Intent Classification (Malayalam): task-oriented dialog intent prediction.
`mteb_massive_intent_mn`	HuggingFace	MTEB MASSIVE Intent Classification (Mongolian): task-oriented dialog intent prediction.
`mteb_massive_intent_ms`	HuggingFace	MTEB MASSIVE Intent Classification (Malay): task-oriented dialog intent prediction.
`mteb_massive_intent_my`	HuggingFace	MTEB MASSIVE Intent Classification (Burmese): task-oriented dialog intent prediction.
`mteb_massive_intent_nb`	HuggingFace	MTEB MASSIVE Intent Classification (Norwegian): task-oriented dialog intent prediction.
`mteb_massive_intent_nl`	HuggingFace	MTEB MASSIVE Intent Classification (Dutch): task-oriented dialog intent prediction.
`mteb_massive_intent_pl`	HuggingFace	MTEB MASSIVE Intent Classification (Polish): task-oriented dialog intent prediction.
`mteb_massive_intent_pt`	HuggingFace	MTEB MASSIVE Intent Classification (Portuguese): task-oriented dialog intent prediction.
`mteb_massive_intent_ro`	HuggingFace	MTEB MASSIVE Intent Classification (Romanian): task-oriented dialog intent prediction.
`mteb_massive_intent_ru`	HuggingFace	MTEB MASSIVE Intent Classification (Russian): task-oriented dialog intent prediction.
`mteb_massive_intent_sl`	HuggingFace	MTEB MASSIVE Intent Classification (Slovenian): task-oriented dialog intent prediction.
`mteb_massive_intent_sq`	HuggingFace	MTEB MASSIVE Intent Classification (Albanian): task-oriented dialog intent prediction.
`mteb_massive_intent_sv`	HuggingFace	MTEB MASSIVE Intent Classification (Swedish): task-oriented dialog intent prediction.
`mteb_massive_intent_sw`	HuggingFace	MTEB MASSIVE Intent Classification (Swahili): task-oriented dialog intent prediction.
`mteb_massive_intent_ta`	HuggingFace	MTEB MASSIVE Intent Classification (Tamil): task-oriented dialog intent prediction.
`mteb_massive_intent_te`	HuggingFace	MTEB MASSIVE Intent Classification (Telugu): task-oriented dialog intent prediction.
`mteb_massive_intent_th`	HuggingFace	MTEB MASSIVE Intent Classification (Thai): task-oriented dialog intent prediction.
`mteb_massive_intent_tl`	HuggingFace	MTEB MASSIVE Intent Classification (Tagalog): task-oriented dialog intent prediction.
`mteb_massive_intent_tr`	HuggingFace	MTEB MASSIVE Intent Classification (Turkish): task-oriented dialog intent prediction.
`mteb_massive_intent_ur`	HuggingFace	MTEB MASSIVE Intent Classification (Urdu): task-oriented dialog intent prediction.
`mteb_massive_intent_vi`	HuggingFace	MTEB MASSIVE Intent Classification (Vietnamese): task-oriented dialog intent prediction.
`mteb_massive_intent_zh_cn`	HuggingFace	MTEB MASSIVE Intent Classification (Chinese (Simplified)): task-oriented dialog intent prediction.
`mteb_massive_intent_zh_tw`	HuggingFace	MTEB MASSIVE Intent Classification (Chinese (Traditional)): task-oriented dialog intent prediction.
`mteb_massive_scenario_af`	HuggingFace	MTEB MASSIVE Scenario Classification (Afrikaans): task-oriented dialog scenario prediction.
`mteb_massive_scenario_am`	HuggingFace	MTEB MASSIVE Scenario Classification (Amharic): task-oriented dialog scenario prediction.
`mteb_massive_scenario_ar`	HuggingFace	MTEB MASSIVE Scenario Classification (Arabic): task-oriented dialog scenario prediction.
`mteb_massive_scenario_az`	HuggingFace	MTEB MASSIVE Scenario Classification (Azerbaijani): task-oriented dialog scenario prediction.
`mteb_massive_scenario_bn`	HuggingFace	MTEB MASSIVE Scenario Classification (Bengali): task-oriented dialog scenario prediction.
`mteb_massive_scenario_cy`	HuggingFace	MTEB MASSIVE Scenario Classification (Welsh): task-oriented dialog scenario prediction.
`mteb_massive_scenario_da`	HuggingFace	MTEB MASSIVE Scenario Classification (Danish): task-oriented dialog scenario prediction.
`mteb_massive_scenario_de`	HuggingFace	MTEB MASSIVE Scenario (German): 18-class scenario classification from voice assistant queries.
`mteb_massive_scenario_el`	HuggingFace	MTEB MASSIVE Scenario Classification (Greek): task-oriented dialog scenario prediction.
`mteb_massive_scenario_en`	HuggingFace	MTEB MASSIVE Scenario (English): 18-class scenario classification from voice assistant queries.
`mteb_massive_scenario_es`	HuggingFace	MTEB MASSIVE Scenario (Spanish): 18-class scenario classification from voice assistant queries.
`mteb_massive_scenario_fa`	HuggingFace	MTEB MASSIVE Scenario Classification (Farsi): task-oriented dialog scenario prediction.
`mteb_massive_scenario_fi`	HuggingFace	MTEB MASSIVE Scenario Classification (Finnish): task-oriented dialog scenario prediction.
`mteb_massive_scenario_fr`	HuggingFace	MTEB MASSIVE Scenario (French): 18-class scenario classification from voice assistant queries.
`mteb_massive_scenario_he`	HuggingFace	MTEB MASSIVE Scenario Classification (Hebrew): task-oriented dialog scenario prediction.
`mteb_massive_scenario_hi`	HuggingFace	MTEB MASSIVE Scenario Classification (Hindi): task-oriented dialog scenario prediction.
`mteb_massive_scenario_hu`	HuggingFace	MTEB MASSIVE Scenario Classification (Hungarian): task-oriented dialog scenario prediction.
`mteb_massive_scenario_hy`	HuggingFace	MTEB MASSIVE Scenario Classification (Armenian): task-oriented dialog scenario prediction.
`mteb_massive_scenario_id`	HuggingFace	MTEB MASSIVE Scenario Classification (Indonesian): task-oriented dialog scenario prediction.
`mteb_massive_scenario_is`	HuggingFace	MTEB MASSIVE Scenario Classification (Icelandic): task-oriented dialog scenario prediction.
`mteb_massive_scenario_it`	HuggingFace	MTEB MASSIVE Scenario Classification (Italian): task-oriented dialog scenario prediction.
`mteb_massive_scenario_ja`	HuggingFace	MTEB MASSIVE Scenario Classification (Japanese): task-oriented dialog scenario prediction.
`mteb_massive_scenario_jv`	HuggingFace	MTEB MASSIVE Scenario Classification (Javanese): task-oriented dialog scenario prediction.
`mteb_massive_scenario_ka`	HuggingFace	MTEB MASSIVE Scenario Classification (Georgian): task-oriented dialog scenario prediction.
`mteb_massive_scenario_km`	HuggingFace	MTEB MASSIVE Scenario Classification (Khmer): task-oriented dialog scenario prediction.
`mteb_massive_scenario_kn`	HuggingFace	MTEB MASSIVE Scenario Classification (Kannada): task-oriented dialog scenario prediction.
`mteb_massive_scenario_ko`	HuggingFace	MTEB MASSIVE Scenario Classification (Korean): task-oriented dialog scenario prediction.
`mteb_massive_scenario_lv`	HuggingFace	MTEB MASSIVE Scenario Classification (Latvian): task-oriented dialog scenario prediction.
`mteb_massive_scenario_ml`	HuggingFace	MTEB MASSIVE Scenario Classification (Malayalam): task-oriented dialog scenario prediction.
`mteb_massive_scenario_mn`	HuggingFace	MTEB MASSIVE Scenario Classification (Mongolian): task-oriented dialog scenario prediction.
`mteb_massive_scenario_ms`	HuggingFace	MTEB MASSIVE Scenario Classification (Malay): task-oriented dialog scenario prediction.
`mteb_massive_scenario_my`	HuggingFace	MTEB MASSIVE Scenario Classification (Burmese): task-oriented dialog scenario prediction.
`mteb_massive_scenario_nb`	HuggingFace	MTEB MASSIVE Scenario Classification (Norwegian): task-oriented dialog scenario prediction.
`mteb_massive_scenario_nl`	HuggingFace	MTEB MASSIVE Scenario Classification (Dutch): task-oriented dialog scenario prediction.
`mteb_massive_scenario_pl`	HuggingFace	MTEB MASSIVE Scenario Classification (Polish): task-oriented dialog scenario prediction.
`mteb_massive_scenario_pt`	HuggingFace	MTEB MASSIVE Scenario Classification (Portuguese): task-oriented dialog scenario prediction.
`mteb_massive_scenario_ro`	HuggingFace	MTEB MASSIVE Scenario Classification (Romanian): task-oriented dialog scenario prediction.
`mteb_massive_scenario_ru`	HuggingFace	MTEB MASSIVE Scenario Classification (Russian): task-oriented dialog scenario prediction.
`mteb_massive_scenario_sl`	HuggingFace	MTEB MASSIVE Scenario Classification (Slovenian): task-oriented dialog scenario prediction.
`mteb_massive_scenario_sq`	HuggingFace	MTEB MASSIVE Scenario Classification (Albanian): task-oriented dialog scenario prediction.
`mteb_massive_scenario_sv`	HuggingFace	MTEB MASSIVE Scenario Classification (Swedish): task-oriented dialog scenario prediction.
`mteb_massive_scenario_sw`	HuggingFace	MTEB MASSIVE Scenario Classification (Swahili): task-oriented dialog scenario prediction.
`mteb_massive_scenario_ta`	HuggingFace	MTEB MASSIVE Scenario Classification (Tamil): task-oriented dialog scenario prediction.
`mteb_massive_scenario_te`	HuggingFace	MTEB MASSIVE Scenario Classification (Telugu): task-oriented dialog scenario prediction.
`mteb_massive_scenario_th`	HuggingFace	MTEB MASSIVE Scenario Classification (Thai): task-oriented dialog scenario prediction.
`mteb_massive_scenario_tl`	HuggingFace	MTEB MASSIVE Scenario Classification (Tagalog): task-oriented dialog scenario prediction.
`mteb_massive_scenario_tr`	HuggingFace	MTEB MASSIVE Scenario Classification (Turkish): task-oriented dialog scenario prediction.
`mteb_massive_scenario_ur`	HuggingFace	MTEB MASSIVE Scenario Classification (Urdu): task-oriented dialog scenario prediction.
`mteb_massive_scenario_vi`	HuggingFace	MTEB MASSIVE Scenario Classification (Vietnamese): task-oriented dialog scenario prediction.
`mteb_massive_scenario_zh_cn`	HuggingFace	MTEB MASSIVE Scenario (Chinese Simplified): 18-class scenario classification.
`mteb_massive_scenario_zh_tw`	HuggingFace	MTEB MASSIVE Scenario Classification (Chinese (Traditional)): task-oriented dialog scenario predicti
`mteb_mtop_domain_de`	HuggingFace	MTEB MTOP Domain Classification (German): task-oriented dialog domain prediction.
`mteb_mtop_domain_en`	HuggingFace	MTEB MTOP Domain Classification (English): task-oriented dialog domain prediction.
`mteb_mtop_domain_es`	HuggingFace	MTEB MTOP Domain Classification (Spanish): task-oriented dialog domain prediction.
`mteb_mtop_domain_fr`	HuggingFace	MTEB MTOP Domain Classification (French): task-oriented dialog domain prediction.
`mteb_mtop_domain_hi`	HuggingFace	MTEB MTOP Domain Classification (Hindi): task-oriented dialog domain prediction.
`mteb_mtop_domain_th`	HuggingFace	MTEB MTOP Domain Classification (Thai): task-oriented dialog domain prediction.
`mteb_mtop_intent_de2`	HuggingFace	MTEB MTOP Intent Classification (German): task-oriented dialog intent prediction.
`mteb_mtop_intent_en`	HuggingFace	MTEB MTOP Intent Classification (English): task-oriented dialog intent prediction.
`mteb_mtop_intent_es2`	HuggingFace	MTEB MTOP Intent Classification (Spanish): task-oriented dialog intent prediction.
`mteb_mtop_intent_fr2`	HuggingFace	MTEB MTOP Intent Classification (French): task-oriented dialog intent prediction.
`mteb_mtop_intent_hi2`	HuggingFace	MTEB MTOP Intent Classification (Hindi): task-oriented dialog intent prediction.
`mteb_mtop_intent_th2`	HuggingFace	MTEB MTOP Intent Classification (Thai): task-oriented dialog intent prediction.
`mteb_multilingual_sentiment`	HuggingFace	MTEB Multilingual Sentiment Classification: multilingual product review sentiment.
`mteb_naija_senti_hau`	HuggingFace	MTEB NaijaSenti (Hausa): Nigerian language sentiment classification.
`mteb_naija_senti_ibo`	HuggingFace	MTEB NaijaSenti (Igbo): Nigerian language sentiment classification.
`mteb_naija_senti_pcm`	HuggingFace	MTEB NaijaSenti (Nigerian Pidgin): Nigerian language sentiment classification.
`mteb_naija_senti_yor`	HuggingFace	MTEB NaijaSenti (Yoruba): Nigerian language sentiment classification.
`mteb_nepali_news`	HuggingFace	MTEB Nepali News Classification: news category classification in Nepali.
`mteb_nordic_lang`	HuggingFace	MTEB Nordic Language Classification: language identification for Nordic languages.
`mteb_online_shopping`	HuggingFace	MTEB OnlineShopping: Chinese online shopping review sentiment classification.
`mteb_poem_sentiment`	HuggingFace	MTEB Poem Sentiment Classification: sentiment classification of poem verses.
`mteb_sensitive_topics`	HuggingFace	MTEB Sensitive Topics Classification: sensitive topic detection in text.
`mteb_sentiment_hindi`	HuggingFace	MTEB Sentiment Analysis Hindi: sentiment classification of Hindi text.
`mteb_swiss_judgement_de`	HuggingFace	MTEB Swiss Judgement Classification (German): Swiss Federal Supreme Court judgement prediction.
`mteb_swiss_judgement_fr`	HuggingFace	MTEB Swiss Judgement Classification (French): Swiss Federal Supreme Court judgement prediction.
`mteb_swiss_judgement_it`	HuggingFace	MTEB Swiss Judgement Classification (Italian): Swiss Federal Supreme Court judgement prediction.
`mteb_tnews`	HuggingFace	MTEB TNews: Chinese news topic classification dataset.
`mteb_turkish_product`	HuggingFace	MTEB Turkish Product Sentiment Classification: Turkish product review sentiment.
`mteb_tweet_sentiment`	HuggingFace	MTEB Tweet Sentiment Extraction: 3-class tweet sentiment classification.
`mteb_tweet_topic`	HuggingFace	MTEB Tweet Topic Single Classification: single-label topic classification of tweets.
`mteb_waimai`	HuggingFace	MTEB Waimai: Chinese food delivery review sentiment classification.
`mteb_yahoo_answers`	HuggingFace	MTEB Yahoo Answers Topics Classification: topic classification of Yahoo Answers questions.
`multi_nli`	HuggingFace	Multi-Genre NLI; 10 diverse genres, 3-way NLI
`nemotron_pii`	HuggingFace	Nemotron PII: document classification by domain type. 100K examples.
`nemotron_safety`	HuggingFace	Nemotron Safety Guard v3: prompt/response safety classification. 451K examples.
`news_category`	HuggingFace	News Category Dataset: 210K Huffington Post headlines with 42 categories.
`news_channel`	Download	Online News Popularity Data Set
`nli_zh_all`	HuggingFace	NLI-ZH: Chinese Natural Language Inference dataset merging multiple Chinese
`no_robots`	HuggingFace	No Robots: 10K high-quality instruction-following conversations with category labels.
`numinamath`	HuggingFace	NuminaMath: competition math problems with verified answers. 131K train examples.
`oasst1`	HuggingFace	OpenAssistant OASST1: 161K messages from 35K conversation trees; role classification.
`ohsumed_7400`	Kaggle	Ohsumed corpus is extracted from MEDLINE database. MEDLINE is designed for multi-label classificatio
`ohsumed_cmu`	Download	OHSUMED is a well-known medical abstracts dataset. It contains 348,566 references,
`openbookqa`	HuggingFace	OpenBookQA elementary science 4-way multiple choice.
`or_bench`	HuggingFace	OR-Bench: over-refusal benchmark prompt classification. 80K examples.
`paws`	HuggingFace	Paraphrase Adversaries from Word Scrambling; challenging paraphrase detection
`paws_x`	HuggingFace	PAWS-X multilingual paraphrase identification. 49K train examples.
`poem_sentiment`	HuggingFace	Poem Sentiment: verse-level sentiment classification from English poetry.
`poem_sentiment_hf`	HuggingFace	Poem Sentiment (Google Research): verse-level sentiment from English poetry.
`product_sentiment_machine_hack`	Download	We challenge the machinehackers community to develop a machine learning model
`pubmed_qa`	HuggingFace	PubMedQA biomedical QA. Context + question → yes/no/maybe.
`qasc`	HuggingFace	QASC: 8-way MC QA with science facts. 8134 train examples.
`qnli`	HuggingFace	Question-answering NLI; whether context sentence contains answer to question
`qqp`	HuggingFace	Quora Question Pairs; whether two questions are semantically equivalent
`race`	HuggingFace	RACE: large-scale reading comprehension from Chinese English exams. 87K train examples. 4-way MC.
`reuters_cmu`	Download	Reuters-21578 is a well-known newswire dataset containing 21,578 documents.
`reuters_r8`	Kaggle	Reuters R8 subset of Reuters 21578 dataset from Kaggle.
`reward_bench`	HuggingFace	RewardBench: preference evaluation - predict which model response is chosen/preferred.
`rotten_tomatoes`	HuggingFace	Rotten Tomatoes movie review sentiment; positive/negative
`rte`	HuggingFace	Recognizing Textual Entailment; entailment vs not-entailment
`scienceqa_vqa`	HuggingFace	ScienceQA: science question + optional lecture → multiple choice answer.
`scotus_classification`	HuggingFace	LexGLUE SCOTUS; US Supreme Court opinion issue area classification
`setfit_ag_news`	HuggingFace	SetFit AG News: 4-class news topic classification (World, Sports, Business, Sci/Tech).
`setfit_emotion`	HuggingFace	SetFit Emotion: 6-class emotion classification from English Twitter messages.
`setfit_yelp_review`	HuggingFace	SetFit Yelp Review Full: 5-class star rating classification from Yelp reviews.
`sib200`	HuggingFace	SIB-200: multilingual 7-class topic classification in 205 languages. 143K train examples.
`sms_spam`	HuggingFace	SMS spam detection: ham or spam binary classification. 5574 examples.
`snli`	HuggingFace	Stanford NLI; premise/hypothesis pairs -> entailment/neutral/contradiction
`spotify_tracks`	HuggingFace	Spotify tracks: predict genre from audio features. 114K examples.
`sst2_hf`	HuggingFace	Stanford Sentiment Treebank 2-class (HF canonical version)
`sst3`	HuggingFace	Three-class sentiment dataset (negative/neutral/positive).
`sst5`	HuggingFace	The SST5 dataset (Stanford Sentiment Treebank, 5-class).
`sst5_setfit`	HuggingFace	SST-5 fine-grained sentiment (SetFit version); 5 classes
`superglue_rte`	HuggingFace	SuperGLUE version of Recognizing Textual Entailment
`synthia`	HuggingFace	SYNTHIA: synthetic instructional examples across 20 academic fields. 1.2M examples.
`tweet_eval_emoji`	HuggingFace	TweetEval emoji: predict emoji from tweet text. 20-class classification.
`tweet_sentiment_extraction`	HuggingFace	Tweet sentiment extraction; positive/negative/neutral
`tweeteval_emotion`	HuggingFace	TweetEval emotion classification; 4 classes
`tweeteval_hate`	HuggingFace	TweetEval hate speech detection
`tweeteval_irony`	HuggingFace	TweetEval irony detection
`tweeteval_offensive`	HuggingFace	TweetEval offensive language detection
`tweeteval_sentiment`	HuggingFace	TweetEval sentiment; positive/negative/neutral tweet classification
`tweeteval_stance`	HuggingFace	TweetEval stance detection; against/in favor/neutral
`twitter_financial_news_topic`	HuggingFace	Twitter financial news 20-class topic classification. 17K train examples.
`wic`	HuggingFace	Word-in-Context; word sense disambiguation as binary classification
`wiki_qa`	HuggingFace	WikiQA: answer sentence selection (relevant/not-relevant). 20K train examples.
`wildchat`	HuggingFace	WildChat: 570K real ChatGPT user interactions; language + toxicity classification.
`winograd_schema`	HuggingFace	Winograd Schema Challenge; pronoun coreference resolution
`winogrande_hf`	HuggingFace	WinoGrande (allenai): large-scale commonsense reasoning benchmark (273K examples).
`wnli`	HuggingFace	Winograd NLI; pronoun reference coreference classification
`xnli`	HuggingFace	XNLI cross-lingual NLI. Premise+hypothesis -> entailment/neutral/contradiction.
`xnli_de`	HuggingFace	XNLI German: cross-lingual natural language inference, German split.
`xnli_en`	HuggingFace	XNLI English: cross-lingual natural language inference, English split.
`xnli_es`	HuggingFace	XNLI Spanish: cross-lingual natural language inference, Spanish split.
`xnli_fr`	HuggingFace	XNLI French: cross-lingual natural language inference, French split.
`xnli_zh`	HuggingFace	XNLI Chinese: cross-lingual natural language inference, Chinese split.
`yahoo_answers`	Download	The Yahoo Answers dataset
`yahoo_answers_topics`	HuggingFace	Yahoo Answers 10-class topic classification. 1.4M train examples.
`yelp_polarity`	HuggingFace	Yelp binary sentiment classification: 1 (negative) or 2 (positive). 560K train examples.
`yelp_review_full`	HuggingFace	Yelp review 5-star rating classification
`yelp_reviews`	Download	The Yelp Reviews dataset

Text Generation / Summarization / Translation (82 datasets)¶

Dataset	Source	Description
`aeslc`	HuggingFace	AESLC: annotated email subject line corpus for email body → subject summarization.
`alpaca`	Download	Stanford Alpaca instruction-tuning dataset (https://github.com/tatsu-lab/stanford_alpaca) for LLM fi
`alpaca_cleaned`	HuggingFace	Alpaca Cleaned: 52K instruction-following samples, cleaned version of Stanford Alpaca.
`alpaca_gpt4`	HuggingFace	Alpaca GPT-4: 52K instruction-following examples.
`alpaca_gpt4_zh`	HuggingFace	Alpaca GPT-4 Chinese: 42K Chinese instruction-following examples.
`ambig_qa`	HuggingFace	AmbigQA: open-domain QA with ambiguous questions. Text-in text-out task.
`arxiv_abstracts_2021`	HuggingFace	ArXiv abstracts 2021: predict abstract from title. 2M train examples.
`arxiv_summarization`	HuggingFace	ArXiv document summarization: predict abstract from full paper text. 203K train examples.
`bbh`	HuggingFace	Big-Bench Hard boolean expressions. Input → target.
`big_patent`	HuggingFace	BigPatent: patent claim summarization (category A). 1.2M train examples.
`bigbench`	HuggingFace	BigBench: abstract narrative understanding task. 2400 train examples.
`billsum`	HuggingFace	BillSum; US congressional and California bill summarization
`bornholm_bitext`	HuggingFace	Bornholm Bitext Mining: Danish-Bornholmsk low-resource translation. 5785 examples.
`cmrc2018`	HuggingFace	CMRC 2018: Chinese machine reading comprehension (SQuAD-style). 10K train examples.
`cnn_dailymail`	HuggingFace	CNN/DailyMail news summarization. Article -> highlights. ~300K examples.
`cnn_dm_hf`	HuggingFace	CNN/DailyMail (abisee): news article summarization, 3.0.0 version (287K examples).
`code_alpaca`	Download	This dataset, created by sahil280114, aims to build and share an instruction-following LLaMA model f
`code_search_net`	HuggingFace	CodeSearchNet Python: function code → docstring (text generation).
`codex_thinking`	HuggingFace	CodeX 2M: code generation with chain-of-thought reasoning. 2.2M examples.
`codexglue_code_to_text`	HuggingFace	CodeXGlue Python code → docstring generation.
`consumer_complaints`	Kaggle	The dataset contains different information of complaints that customers have made about a multiple p
`dialogsum`	HuggingFace	DialogSum dialogue summarization. Dialogue -> summary. ~13K examples.
`drop`	HuggingFace	DROP: Discrete Reasoning Over Paragraphs reading comprehension. 77K train examples.
`duorc`	HuggingFace	DuoRC SelfRC: movie plot + question → answer text.
`europarl_bg_cs`	HuggingFace	Europarl: European Parliament proceedings Bulgarian-Czech translation. 400K train pairs.
`europarl_bg_en`	HuggingFace	Europarl: European Parliament proceedings Bulgarian-English translation.
`europarl_cs_en`	HuggingFace	Europarl: European Parliament proceedings Czech-English translation.
`europarl_da_en`	HuggingFace	Europarl: European Parliament proceedings Danish-English translation.
`europarl_de_en`	HuggingFace	Europarl: European Parliament proceedings German-English translation.
`europarl_el_en`	HuggingFace	Europarl: European Parliament proceedings Greek-English translation.
`europarl_en_es`	HuggingFace	Europarl: European Parliament proceedings English-Spanish translation.
`europarl_en_fr`	HuggingFace	Europarl: European Parliament proceedings English-French translation.
`europarl_en_it`	HuggingFace	Europarl: European Parliament proceedings English-Italian translation.
`europarl_en_nl`	HuggingFace	Europarl: European Parliament proceedings English-Dutch translation.
`europarl_en_pl`	HuggingFace	Europarl: European Parliament proceedings English-Polish translation.
`europarl_en_pt`	HuggingFace	Europarl: European Parliament proceedings English-Portuguese translation.
`europarl_en_ro`	HuggingFace	Europarl: European Parliament proceedings English-Romanian translation.
`europarl_en_sv`	HuggingFace	Europarl: European Parliament proceedings English-Swedish translation.
`flashrag_2wikimultihop`	HuggingFace	FlashRAG 2WikiMultiHopQA: multi-hop QA. 15K train examples.
`gaiasky_qa`	HuggingFace	Gaiasky astronomy Q&A dataset. 3.8K examples.
`govreport_summarization`	HuggingFace	GovReport: long government report summarization. 17K train examples.
`gsm8k`	HuggingFace	GSM8K; grade school math word problems with step-by-step solutions
`gsm8k_openai`	HuggingFace	GSM8K (OpenAI): grade school math word problems (8.5K problems).
`hh_rlhf`	HuggingFace	Anthropic HH-RLHF: 170K human feedback pairs for helpful/harmless RLHF training.
`hotpot_qa`	HuggingFace	HotpotQA multi-hop reasoning QA. Question → answer.
`kilt_nq`	HuggingFace	KILT NQ: Natural Questions in the KILT knowledge-intensive framework. 87K train examples.
`language_identification`	HuggingFace	Language identification; 20 languages from Twitter data
`math500`	HuggingFace	MATH-500 test subset; competition math with step-by-step solutions
`mathvista`	HuggingFace	MathVista: math reasoning over images. Question → answer.
`mbpp`	HuggingFace	MBPP Python problem description → solution code generation. 500 problems.
`medical_flashcards`	HuggingFace	Medical flashcards: Q&A for medical topics. 34K examples.
`multi30k`	HuggingFace	Multi30k: English-German image caption translation. 29K train pairs.
`multiun_ar_en`	HuggingFace	MultiUN Arabic-English United Nations document translation. 9.8M train pairs.
`natural_questions`	HuggingFace	Natural Questions: real Google search questions with Wikipedia answers.
`natural_questions_hard_negatives`	HuggingFace	Natural Questions hard negatives for retrieval/ranking
`naver_news_summary`	HuggingFace	Naver News Korean summarization dataset
`opus100_en_es`	HuggingFace	OPUS-100 English-Spanish parallel corpus. ~1M sentence pairs.
`opus100_en_fr`	HuggingFace	OPUS-100 English-French parallel corpus. ~1M sentence pairs.
`opus_books_en_fr`	HuggingFace	OPUS Books English-French literary translations.
`orca_dpo_pairs`	HuggingFace	Intel Orca DPO Pairs: preference dataset for direct preference optimization.
`orca_math`	HuggingFace	OrcaMath: 200K math word problems with solutions.
`phinc`	HuggingFace	PHINC: Hindi-English parallel corpus. 13K train pairs.
`pubmed_summarization`	HuggingFace	PubMed biomedical document summarization. 120K train examples.
`python_code_instructions`	HuggingFace	Python code generation from instructions. 18K examples.
`samsum`	HuggingFace	SAMSum: dialogue summarization. 14K train examples.
`sciq`	HuggingFace	SciQ science QA with support text. Support + question → correct_answer (text).
`scitail`	HuggingFace	SciTail; science textual entailment from multiple-choice questions
`setimes_bg_bs`	HuggingFace	SETimes: South-East European Times Bulgarian-Bosnian translation. 136K train pairs.
`squad`	HuggingFace	SQuAD v1.1 extractive QA. Context + question → answer text. 87K examples.
`squad_v2`	HuggingFace	SQuAD v2 extractive QA with unanswerable questions. 130K examples.
`tofu`	HuggingFace	TOFU: Fictitious Unlearning. Fictional author QA dataset for LLM unlearning research. 4K train examp
`trivia_qa`	HuggingFace	TriviaQA reading comprehension. Question → answer.
`truthful_qa`	HuggingFace	TruthfulQA: question → best truthful answer. 817 questions.
`vukuzenzele`	HuggingFace	Vukuzenzele: Afrikaans-English sentence pairs. 2.7K train examples.
`web_questions`	HuggingFace	WebQuestions: Freebase-grounded factoid QA. 3.8K train examples.
`winogrande`	HuggingFace	WinoGrande; large-scale Winograd schema challenge
`wmt14_de_en`	HuggingFace	WMT14 German-English news translation. ~4.5M sentence pairs.
`wmt16_de_en`	HuggingFace	WMT16 German-English news translation.
`wmt19_de_en`	HuggingFace	WMT19 German-English news translation.
`wmt_t2t_de_en`	HuggingFace	WMT T2T German-English news translation. 4.6M train sentence pairs.
`xsum`	HuggingFace	XSum BBC news summarization. Document -> one-sentence summary. ~200K examples.
`xsum_hf`	HuggingFace	XSum (Edinburgh NLP): extreme summarization of BBC news articles (226K examples).

Text Regression (91 datasets)¶

Dataset	Source	Description
`adult_income_hf`	HuggingFace	Adult/Census Income dataset for income >50K classification
`ae_price_prediction`	Download	Innerwear Data from Victoria's Secret and Others
`allocine`	HuggingFace	AlloCine: French movie review sentiment dataset. Binary positive/negative labels.
`amazon_review_polarity`	Download	The Amazon Reviews Polarity dataset
`amazon_reviews_2023`	HuggingFace	Amazon Reviews 2023: predict star rating from review title and text. 571M train examples.
`app_reviews`	HuggingFace	App Reviews: 288K mobile app reviews with 1-5 star ratings.
`beavertails`	HuggingFace	BeaverTails LLM safety: prompt+response → is_safe binary + category.
`blimp`	HuggingFace	BLiMP: Benchmark of Linguistic Minimal Pairs. Binary grammaticality
`bookprice_prediction`	Download	Here we explore a database of books of different genres, from thousands of authors.
`boolq_standalone`	HuggingFace	BoolQ standalone; naturally occurring yes/no questions with passage
`california_house_price`	Download	Predict house sale prices based on the house information, such as # of bedrooms,
`civil_comments`	HuggingFace	Civil Comments toxicity classification; multi-attribute toxicity labels
`code_defect_detection`	HuggingFace	CodeXGLUE: binary defect detection in code functions. 21K train examples.
`factcheck`	HuggingFace	FactCheck: multilingual fact-checking question ranking dataset. 2M train examples.
`fake_job_postings2`	Download	This dataset contains 18K job descriptions out of which about 800 are fake.
`fineweb_edu`	HuggingFace	FineWeb-Edu: 1.3T token high-quality educational web text; 10BT sample subset.
`germeval18`	HuggingFace	GermEval 2018: German offensive language detection. Binary and multi-class labels.
`google_qa_answer_type_reason_explanation`	Download	Google QUEST Q&A Labeling
`google_qa_question_type_reason_explanation`	Download	Google QUEST Q&A Labeling
`hc3`	HuggingFace	Human ChatGPT Comparison Corpus (HC3): binary classification of whether an
`hc3_chinese`	HuggingFace	HC3-Chinese: Chinese Human ChatGPT Comparison Corpus. Binary classification
`helpsteer2`	HuggingFace	HelpSteer2: 21K prompt-response pairs with 5 quality attributes rated 0-4.
`imdb`	Kaggle	IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
`imdb_genre_prediction`	Download	A data set of 1,000 most popular movies on IMDB in the last 10 years. The data points included are:
`imdb_mteb`	HuggingFace	IMDB movie review sentiment binary classification (positive/negative). 24K train examples.
`irony`	Download	The Reddit Irony dataset.
`jc_penney_products`	Download	JCPenney products
`jigsaw_unintended_bias`	Download	A dataset labeled for identity mentions and optimizing a metric designed to measure unintended bias.
`jigsaw_unintended_bias100k`	Download	A dataset labeled for identity mentions and optimizing a metric designed to measure unintended bias.
`kick_starter_funding`	Download	Funding Successful Projects on Kickstarter
`klue_sts`	HuggingFace	KLUE STS Korean sentence similarity. Sentence pair → score (0-5 regression).
`lmsys_arena`	HuggingFace	LMSYS Arena: 55K human preference comparisons between LLM responses.
`measuring_hate_speech`	HuggingFace	Measuring Hate Speech; continuous hate speech score regression
`mercari_price_suggestion`	Download	Predict product price based on details like product category name, brand name, and item condition.
`mercari_price_suggestion100K`	Download	Predict product price based on details like product category name, brand name, and item condition.
`moral_stories`	HuggingFace	Moral Stories: binary classification of moral vs immoral actions given a norm,
`msmarco_passage`	HuggingFace	MS MARCO passage retrieval; query-passage relevance scoring
`mteb_amazon_polarity`	HuggingFace	MTEB Amazon Polarity: binary positive/negative sentiment from Amazon reviews (4M reviews).
`mteb_biosses`	HuggingFace	MTEB BIOSSES: biomedical sentence similarity benchmark (100 sentence pairs).
`mteb_imdb`	HuggingFace	MTEB IMDB: binary movie review sentiment classification.
`mteb_sts17_ar`	HuggingFace	MTEB STS17 (Arabic-Arabic): Arabic semantic textual similarity.
`mteb_sts17_de`	HuggingFace	MTEB STS17 (German-English): cross-lingual semantic textual similarity.
`mteb_sts17_en`	HuggingFace	MTEB STS17 (English-English): semantic textual similarity regression.
`mteb_sts17_es`	HuggingFace	MTEB STS17 (Spanish-English): cross-lingual semantic textual similarity.
`mteb_sts17_fr`	HuggingFace	MTEB STS17 (French-English): cross-lingual semantic textual similarity.
`mteb_sts22_ar`	HuggingFace	MTEB STS22 (Arabic): semantic textual similarity regression.
`mteb_sts22_de`	HuggingFace	MTEB STS22 (German): semantic textual similarity regression.
`mteb_sts22_de_en`	HuggingFace	MTEB STS22 (German-English cross-lingual): semantic textual similarity.
`mteb_sts22_de_fr`	HuggingFace	MTEB STS22 (German-French cross-lingual): semantic textual similarity.
`mteb_sts22_en`	HuggingFace	MTEB STS22 (English): semantic textual similarity regression.
`mteb_sts22_es`	HuggingFace	MTEB STS22 (Spanish): semantic textual similarity regression.
`mteb_sts22_es_en`	HuggingFace	MTEB STS22 (Spanish-English cross-lingual): semantic textual similarity.
`mteb_sts22_es_it`	HuggingFace	MTEB STS22 (Spanish-Italian cross-lingual): semantic textual similarity.
`mteb_sts22_fr`	HuggingFace	MTEB STS22 (French): semantic textual similarity regression.
`mteb_sts22_it`	HuggingFace	MTEB STS22 (Italian): semantic textual similarity regression.
`mteb_sts22_pl`	HuggingFace	MTEB STS22 (Polish): semantic textual similarity regression.
`mteb_sts22_pl_en`	HuggingFace	MTEB STS22 (Polish-English cross-lingual): semantic textual similarity.
`mteb_sts22_ru`	HuggingFace	MTEB STS22 (Russian): semantic textual similarity regression.
`mteb_sts22_tr`	HuggingFace	MTEB STS22 (Turkish): semantic textual similarity regression.
`mteb_sts22_zh`	HuggingFace	MTEB STS22 (Chinese): semantic textual similarity regression.
`mteb_sts22_zh_en`	HuggingFace	MTEB STS22 (Chinese-English cross-lingual): semantic textual similarity.
`mteb_stsbenchmark`	HuggingFace	MTEB STSBenchmark: STS Benchmark semantic similarity (8K sentence pairs, scores 0-5).
`mteb_toxic_convo`	HuggingFace	MTEB Toxic Conversations 50K: binary toxicity classification.
`multirc`	HuggingFace	SuperGLUE MultiRC: paragraph + question + answer → binary (answer correct?).
`news_popularity2`	Download	Online News Popularity Data Set
`persuasion`	HuggingFace	Anthropic Persuasion: rate how persuasive arguments are for various claims.
`sarcastic_headlines`	Kaggle	A dataset to determine if a news headline is sarcastic or serious.
`scandisent`	HuggingFace	ScandiSent: Nordic language sentiment classification. 50K train examples.
`setfit_amazon_polarity`	HuggingFace	SetFit Amazon Polarity: binary positive/negative sentiment from Amazon reviews.
`setfit_mrpc`	HuggingFace	SetFit MRPC: Microsoft Research Paraphrase Corpus, binary paraphrase detection.
`setfit_sst2`	HuggingFace	SetFit SST-2: Stanford Sentiment Treebank binary sentiment (SetFit format).
`setfit_subj`	HuggingFace	SetFit SUBJ: subjectivity detection (subjective vs objective sentences).
`sickr`	HuggingFace	SICK-R: sentences involving compositional knowledge. 9927 test examples.
`sst2`	HuggingFace	The SST2 dataset (Stanford Sentiment Treebank, binary).
`stackoverflow_posts`	HuggingFace	Stack Overflow posts: predict question score from title and body. 58M train examples.
`sts12`	HuggingFace	STS 2012: semantic textual similarity benchmark. 2234 train examples.
`sts13`	HuggingFace	STS 2013: semantic textual similarity. 1500 test examples.
`sts14`	HuggingFace	STS 2014: semantic textual similarity. 3750 test examples.
`sts15`	HuggingFace	STS 2015: semantic textual similarity. 3000 test examples.
`sts16`	HuggingFace	STS 2016: semantic textual similarity. 1186 test examples.
`sts17`	HuggingFace	STS 2017: cross-lingual STS. 5346 test examples.
`sts22`	HuggingFace	STS 2022: cross-lingual semantic textual similarity. 4.6K train examples.
`sts_benchmark`	HuggingFace	STS Benchmark; sentence pair semantic similarity scoring
`stsb`	HuggingFace	Semantic Textual Similarity Benchmark; similarity score 0-5
`stsb_de`	HuggingFace	STS Benchmark German (machine translated). 5749 train examples.
`stsb_sentencetransformers`	HuggingFace	STS Benchmark: semantic textual similarity scoring (0-5 scale). 5.7K train examples.
`student_performance`	HuggingFace	Student performance regression; predict final grade from demographics
`toxic_chat`	HuggingFace	LMSys ToxicChat: human toxicity annotation for LLM conversations.
`wine_reviews`	Download	Wine Reviews
`women_clothing_review`	Download	Women's E-Commerce Clothing Reviews
`yelp_review_polarity`	Download	The Yelp Polarity dataset

Sequence Tagging / NER (12 datasets)¶

Dataset	Source	Description
`acronym_identification`	HuggingFace	Acronym identification: tokens → B-long/I-long/B-short/I-short/O tags.
`audioset_balanced`	HuggingFace	AudioSet balanced: audio event classification with 527 sound classes from YouTube. 18K train example
`few_nerd`	HuggingFace	Few-NERD fine-grained NER with 8 coarse and 66 fine entity types.
`multinerd`	HuggingFace	MultiNERD multilingual NER. 31 fine-grained entity types. CC BY 4.0.
`nq_open`	HuggingFace	Natural Questions Open: open-domain QA with Wikipedia answers. 88K train examples.
`pii_masking`	HuggingFace	PII masking: source text → BIO labels for PII tokens.
`universal_dependencies`	HuggingFace	Universal Dependencies English EWT: POS tagging with UPOS tags. 12K train sentences.
`wikiann`	HuggingFace	WikiANN (PAN-X) Named Entity Recognition — English
`wikiann_de`	HuggingFace	WikiANN German NER. IOB2 tags: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC.
`wikiann_en`	HuggingFace	WikiANN English NER. IOB2 tags: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC.
`wikiann_zh`	HuggingFace	WikiANN Chinese NER. IOB2 tags: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC.
`winobias`	HuggingFace	WinoBias coreference gender bias. Sentence → label.

Multi-label Classification (4 datasets)¶

Dataset	Source	Description
`go_emotions`	HuggingFace	GoEmotions: Multi-label Emotion Classification
`go_emotions_multiclass`	HuggingFace	GoEmotions multi-label emotion (28 classes). Set output. Same as go_emotions but listed separately.
`lex_glue_ecthr`	HuggingFace	LexGLUE ECtHR case text → violated ECHR articles (multi-label). Set output.
`lex_glue_eurlex`	HuggingFace	LexGLUE EuroLex EU documents → EuroVoc concept labels (multi-label). Set output.

Image Classification (29 datasets)¶

Dataset	Source	Description
`ai_generated_ecommerce`	HuggingFace	AI-generated vs real e-commerce product images. 6K examples.
`beans`	HuggingFace	Beans leaf disease classification; 3 classes: angular leaf spot, bean rust, healthy
`cifar10`	HuggingFace	CIFAR-10; 10-class 32x32 image classification
`cifar100`	HuggingFace	CIFAR-100; 100-class 32x32 image classification
`eurosat`	HuggingFace	EuroSAT; land use/cover classification from satellite imagery; 10 classes
`eurosat_rgb`	HuggingFace	EuroSAT RGB: Sentinel-2 satellite image classification (10 land use classes). 16K train examples.
`fashion_mnist`	HuggingFace	Fashion MNIST; 10-class clothing item classification
`food101`	HuggingFace	Food-101; 101-class food image classification
`graid_bdd`	HuggingFace	GrAID-BDD: grounded autonomous driving image QA. 4.6M train examples.
`gtsrb`	HuggingFace	GTSRB; German Traffic Sign Recognition Benchmark; 43 classes
`handwritten_crossouts`	HuggingFace	Handwritten cross-outs: classify handwriting correction styles. 22K examples.
`imagenet_100`	HuggingFace	ImageNet-100: 100-class subset of ImageNet. 126K train examples.
`intuitive_physics`	HuggingFace	Intuitive Physics: image-based physical intuition classification. 280K train.
`kvasir_vqa`	HuggingFace	Kvasir-VQA: gastrointestinal endoscopy visual QA. 143K train examples.
`map_trace`	HuggingFace	MapTrace: map-type image classification (7 categories). 20K examples.
`mini_imagenet`	HuggingFace	Mini-ImageNet: 100-class image classification subset of ImageNet. 50K train examples.
`mnist`	Download	The MNIST database of handwritten digits, available from this page,
`mnist_ylecun`	HuggingFace	MNIST handwritten digit recognition. 60K train examples.
`newyorker_caption_contest`	HuggingFace	New Yorker Caption Contest — Multimodal Image+Text Classification
`openfake`	HuggingFace	OpenFake: AI-generated vs real image binary classification.
`oxford_pets`	HuggingFace	Oxford-IIIT Pet; 37 pet breed classification
`path_vqa`	HuggingFace	PathVQA: pathology visual QA (yes/no + open). 20K train examples.
`rendered_sst2`	HuggingFace	Rendered SST2; sentiment classification from rendered text images
`stanford_cars`	HuggingFace	Stanford Cars; 196-class fine-grained car make/model/year classification
`sun397`	HuggingFace	SUN397; scene understanding; 397 scene categories
`svhn`	HuggingFace	SVHN; Street View House Numbers digit classification
`tiny_imagenet`	HuggingFace	Tiny ImageNet; 200-class 64x64 subset of ImageNet
`tobacco_document`	HuggingFace	Tobacco document image classification with OCR text. 2.2K examples.
`wikiart`	HuggingFace	WikiArt; artwork style classification across 27 art styles

Document Understanding / VQA (8 datasets)¶

Dataset	Source	Description
`ai2d_diagrams`	HuggingFace	AI2 Diagrams VQA: diagram image + question → answer.
`cord_v2`	HuggingFace	CORD v2 receipt understanding: receipt image -> JSON ground truth (text generation).
`docvqa`	HuggingFace	DocVQA: document image + question → answer from the document.
`invoice_data`	HuggingFace	Invoice document understanding: invoice image -> JSON ground truth (text generation).
`merit`	HuggingFace	MERIT: historical document image recognition. 7K train examples.
`textvqa`	HuggingFace	TextVQA: image with text + question → answer requiring reading the text.
`vqa_rad`	HuggingFace	VQA-RAD; radiology visual question answering
`vqav2`	HuggingFace	VQAv2: image + question → free-form answer.

Semantic Segmentation (1 datasets)¶

Dataset	Source	Description
`camseq`	Kaggle	CamSeq01 Cambridge Labeled Objects in Video

Audio Classification (8 datasets)¶

Dataset	Source	Description
`abjad_kids`	HuggingFace	Abjad-Kids: Arabic letter audio classification (28 classes). 40K examples.
`emodb`	HuggingFace	EMo-DB; Berlin Database of Emotional Speech; 7 emotion classes
`esc50`	Download	ESC-50: Environmental Sound Classification
`minds14`	HuggingFace	MINDS-14: multilingual banking intent classification from audio. 8K train examples.
`mmsulab`	HuggingFace	MMSuLab: multilingual audio speech laboratory data. 1.9M train examples.
`naturelm_audio`	HuggingFace	NatureLM Audio: wildlife/nature audio instruction following. 26M train examples.
`speech_massive`	HuggingFace	Speech-MASSIVE: multilingual audio intent classification. 23K train examples.
`tadabur`	HuggingFace	Tadabur: Arabic Quran audio with reciter, surah, and ayah classification. 409K examples.

Speech Recognition (ASR) (7 datasets)¶

Dataset	Source	Description
`ami_asr`	HuggingFace	AMI corpus: meeting audio transcription. 108K train examples.
`cantonese_asr`	HuggingFace	Cantonese speech recognition dataset
`librispeech`	HuggingFace	LibriSpeech; English speech recognition from audiobooks; clean 100h split
`mls_german`	HuggingFace	Multilingual LibriSpeech German ASR
`peoples_speech`	HuggingFace	People's Speech; 30K-hour diverse English ASR
`ravnursson_asr`	HuggingFace	Ravnursson: Faroese audio speech recognition dataset. 65K train examples.
`voxpopuli`	HuggingFace	VoxPopuli; European Parliament speech ASR; 23 languages

Tabular Classification (12 datasets)¶

Dataset	Source	Description
`bbcnews`	Download	BBC News Classification from Kaggle.
`connect4`	Kaggle	Each row represents the end results of a Connect-4 game.
`diabetes_readmission`	HuggingFace	Diabetes hospital readmission prediction
`forest_cover`	Download	The Forest Cover Type dataset.
`hermes_function_calling`	HuggingFace	Hermes function calling: tool-use conversations with category labels. 1893 train examples.
`iris`	Download	Iris Dataset
`iris_sklearn`	HuggingFace	Iris dataset: classic 3-class flower classification by petal/sepal measurements.
`mushroom_edibility`	Download	This data set includes descriptions of hypothetical samples corresponding
`otto_group_product`	Download	The Otto Group Product Classification Challenge
`poker_hand`	Download	Each record is an example of a hand consisting of five playing cards
`taix_ray`	HuggingFace	TAIX-Ray: thoracic X-ray study - predict patient sex from age and other metadata. 137K train example
`walmart_recruiting`	Download	Walmart Recruiting: Trip Type Classification

Tabular Binary Classification (24 datasets)¶

Dataset	Source	Description
`adult_census_income`	Download	Predict whether income exceeds $50K/yr based on census data
`amazon_employee_access_challenge`	Download	There is a considerable amount of data regarding an employee’s role within an organization and the r
`bnp_claims_management`	Download	The BNP Paribas Cardif Claims Management dataset.
`compas_recidivism`	HuggingFace	COMPAS recidivism risk prediction; criminal justice fairness benchmark
`credit_card_default`	HuggingFace	Credit card default prediction from payment history
`creditcard_fraud`	Kaggle	The Machine Learning Group ULB Dataset
`customer_churn_prediction`	Download	Dataset from a Kaggle competition that is about predicting whether a customer will change
`electricity_tabular`	HuggingFace	Electricity market price direction classification
`heart_failure`	HuggingFace	Heart failure clinical records; death event prediction
`higgs`	Download	The Higgs Boson dataset.
`ieee_fraud`	Download	The IEEE-CIS Fraud Detection Dataset
`imbalanced_insurance`	Kaggle	Health Insurance Cross Sell Prediction
`kdd_appetency`	Download	The KDD Cup 2009 Appetency dataset.
`kdd_churn`	Download	The KDD Cup 2009 Churn dataset.
`kdd_upselling`	Download	The KDD Cup 2009 Upselling dataset.
`noshow_appointments`	Kaggle	110.527 medical appointments its 14 associated variables (characteristics).
`numerai28pt6`	Kaggle	Encrypted Stock Market Data from Numerai dataset from Kaggle.
`porto_seguro_safe_driver`	Download	Predict the probability that an auto insurance policy holder files a claim.
`santander_customer_satisfaction`	Download	Santander Customer Satisfaction Prediction.
`santander_customer_transaction`	Download	Santander Customer Transaction Prediction.
`synthetic_fraud`	Kaggle	The Synthetic Financial Datasets For Fraud Detection dataset.
`talkingdata_adtrack_fraud`	Download	TalkingData AdTracking Fraud Detection Challenge.
`telco_customer_churn`	Kaggle	The Telco customer churn data contains information about a fictional telco company
`titanic`	Download	The Titanic dataset: use machine learning to create a model

Tabular Regression (18 datasets)¶

Dataset	Source	Description
`allstate_claims_severity`	Download	Allstate Claims Severity.
`ames_housing`	Download	The Ames Housing dataset.
`california_housing`	Download	California Housing dataset from the 1990 US Census. Predict median house value
`electricity`	Download	Electricity demand dataset. Half-hourly electricity demand in Victoria, Australia during 2014, along
`gaia_cepheids`	HuggingFace	Gaia DR3 Cepheids: predict stellar period from photometry. 15K examples.
`gaia_rrlyrae`	HuggingFace	Gaia DR3 RR Lyrae: predict fundamental period. 271K examples.
`gaia_spectroscopic_binaries`	HuggingFace	Gaia DR3 Spectroscopic Binaries: predict orbital period. 186K examples.
`gaia_young_stellar_objects`	HuggingFace	Gaia DR3 Young Stellar Objects: class prediction from photometry. 79K examples.
`mercedes_benz_greener`	Download	The Mercedes-Benz Greener Manufacturing dataset.
`naval`	Download	Condition Based Maintenance of Naval Propulsion Plants Data Set
`protein`	Download	Physicochemical Properties of Protein Tertiary Structure Data Set.
`repid`	HuggingFace	REPID: Retinal Perceptual Image Dataset — tabular perceptual quality metrics
`rossman_store_sales`	Download	The Rossmann Store Sales dataset.
`santander_value_prediction`	Download	The Santander Value Prediction Challenge dataset.
`sarcos`	Download	The data relates to an inverse dynamics problem for a seven
`stocks_daily_price`	HuggingFace	Daily stock OHLCV: predict close price from symbol and features. 25M examples.
`temperature`	Kaggle	Hourly temperature dataset from Kaggle
`yosemite`	Download	Yosemite temperatures dataset.

Multimodal (7 datasets)¶

Dataset	Source	Description
`flickr8k`	Download	A new benchmark collection for sentence-based image description and search,
`goodbooks_books`	Download	goodbooks_books is a multimodal dataset of 10K books, taken from the goodreads dataset.
`insurance_lite`	Kaggle	The dataset consists of parameters such as the images of damaged cars,
`mobile_mold`	HuggingFace	MobileMold: smartphone food mold binary image classification. 3.5K examples.
`twitter_bots`	Kaggle	A dataset for Twitter Bot account detection.
`wmt15`	Kaggle	French/English parallel texts for training translation models.
`world_speech_asr`	HuggingFace	WorldSpeech: multilingual audio quality with CER measurement. 1.2M train examples.

Special / Utility (1 datasets)¶

Dataset	Source	Description
`hugging_face`	Download	Hugging Face Datasets

Adding datasets¶

To add a dataset to the Ludwig Dataset Zoo, see Add a Dataset.