Dataset Zoo
The Ludwig Dataset Zoo provides datasets that can be directly plugged into a Ludwig model.
The simplest way to use a dataset is to reference it as a URI when specifying the training set:
ludwig train --dataset ludwig://reuters ...
Any Ludwig dataset can be specified as a URI of the form ludwig://<dataset>.
Datasets can also be programatically imported and loaded into a Pandas DataFrame using the .load() method:
from ludwig.datasets import reuters
# Loads into single dataframe with a 'split' column:
dataset_df = reuters.load()
# Loads into split dataframes:
train_df, test_df, _ = reuters.load(split=True)
The ludwig.datasets API also provides functions to list, describe, and get datasets. For example:
import ludwig.datasets
# Gets a list of all available dataset names.
dataset_names = ludwig.datasets.list_datasets()
# Prints the description of the titanic dataset.
print(ludwig.datasets.describe_dataset("titanic"))
titanic = ludwig.datasets.get_dataset("titanic")
# Loads into single dataframe with a 'split' column:
dataset_df = titanic.load()
# Loads into split dataframes:
train_df, test_df, _ = titanic.load(split=True)
Kaggle Datasets¶
Some datasets are hosted on Kaggle and require a kaggle account. To use these, you'll need to set up Kaggle credentials in your environment. If the dataset is part of a Kaggle competition, you'll need to accept the terms on the competition page.
To check programmatically, datasets have an .is_kaggle_dataset property.
Downloading, Processing, and Exporting¶
Datasets are first downloaded into LUDWIG_CACHE, which may be set as an environment variable and defaults to
$HOME/.ludwig_cache.
Datasets are automatically loaded, processed, and re-saved as parquet files in the cache.
To export the processed dataset, including any files it depends on, use the .export(output_directory) method. This
is recommended if the dataset contains media files like images or audio files. File paths are relative to the working
directory of the training process.
from ludwig.datasets import twitter_bots
# Exports twitter bots dataset and image files to the current working directory.
twitter_bots.export(".")
End-to-end Example¶
Here's an end-to-end example of training a model using the MNIST dataset:
from ludwig.api import LudwigModel
from ludwig.datasets import mnist
# Initializes a Ludwig model
config = {
"input_features": [{"name": "image_path", "type": "image"}],
"output_features": [{"name": "label", "type": "category"}],
}
model = LudwigModel(config)
# Loads and splits MNIST dataset
training_set, test_set, _ = mnist.load(split=True)
# Exports the mnist image files to the current working directory.
mnist.export(".")
# Runs model training
results = model.train(training_set=training_set, test_set=test_set, model_name="mnist_model")
train_stats = results.train_stats
Dataset Splits¶
All datasets in the dataset zoo are provided with a default train/validation/test split. When loading with
split=False, the default split will be returned (and is guaranteed to be the same every time). With split=True,
Ludwig will randomly re-split the dataset.
Note
Some benchmark or contest datasets are released with held-out test set labels. In other words, the train and validation splits have labels, but the test set does not. Most Kaggle contest datasets have this unlabeled test set.
Splits:
- train: Data to train on. Required, must have labels.
- validation: Subset of dataset to evaluate while training. Optional, must have labels.
- test: Held out from model development, used for later testing. Optional, may not be labeled.
Zoo Datasets¶
Ludwig ships with 590 datasets spanning 15 task categories.
Each dataset can be loaded with ludwig datasets download <name> or from ludwig.datasets import <name>.
Text Classification (286 datasets)¶
| Dataset | Source | Description |
|---|---|---|
aegis_safety |
HuggingFace | NVIDIA Aegis 2.0: AI content safety classification. 30K train examples. |
ag_news_hf |
HuggingFace | AG News 4-class topic classification (HF version) |
agnews |
Download | News articles categorized as "World", "Sports", "Business", and "Science". |
amazon_massive_intent |
HuggingFace | Amazon MASSIVE multilingual intent classification (60 intents, all 51 languages combined). 587K trai |
amazon_massive_scenario |
HuggingFace | Amazon MASSIVE multilingual scenario classification (18 scenarios, 60 languages). 587K train. |
amazon_polarity |
HuggingFace | Amazon product review polarity; positive/negative |
amazon_reviews |
Download | The Amazon Reviews dataset |
amazon_science_massive |
HuggingFace | Amazon MASSIVE multilingual intent classification dataset (51 languages, 60 intents). |
anli |
HuggingFace | Adversarial NLI; iteratively adversarial NLI benchmark |
aqua_rat |
HuggingFace | AQuA-RAT: algebraic word problems 5-way MC with rationales. 97K train examples. |
arc_challenge |
HuggingFace | AI2 ARC-Challenge harder science QA. 4-way multiple choice. |
arc_easy |
HuggingFace | AI2 ARC-Easy science QA. 4-way multiple choice. A/B/C/D. |
banking77 |
HuggingFace | Banking77; 77-class banking customer intent classification |
banking77_legacy |
HuggingFace | Banking77 (legacy) 77-class banking intent classification. 10K train examples. |
belebele |
HuggingFace | Belebele multilingual reading comprehension; multiple-choice |
belebele_fr |
HuggingFace | Belebele French; multilingual reading comprehension |
bitext_customer_intent |
HuggingFace | Bitext customer support intent detection (26K train, 27 intents) |
bitext_customer_support |
HuggingFace | Bitext customer support: 27 intents, 10 categories. 26K train examples. |
boolq |
HuggingFace | Boolean Questions; reading comprehension yes/no questions from Google |
brazilian_toxic_tweets |
HuggingFace | Brazilian toxic tweets binary classification. 8K train examples. |
ccnews |
HuggingFace | CC-News: multilingual web news articles from Common Crawl. 1.9M train examples. |
climate_fever |
HuggingFace | ClimateFEVER: climate change claim verification dataset. |
climate_sentiment |
HuggingFace | ClimateBERT climate-related text sentiment analysis |
clinc_oos |
HuggingFace | CLINC OOS: 150-class intent classification with out-of-scope detection. 10K train examples. |
clue_afqmc |
HuggingFace | CLUE AFQMC: Chinese short text similarity for financial domain. 34K train examples. |
code_contests |
HuggingFace | DeepMind code contests: competitive programming problems with difficulty classification. 3.7K train |
coig_cqia |
HuggingFace | COIG-CQIA: Chinese instruction dataset with domain/task type. 1111 examples. |
cola |
HuggingFace | Corpus of Linguistic Acceptability; grammatically acceptable or not |
commitment_bank |
HuggingFace | CommitmentBank; textual entailment with 3-way classification |
commonsense_qa |
HuggingFace | CommonsenseQA 5-way multiple choice. Question → A/B/C/D/E. |
copa |
HuggingFace | Choice Of Plausible Alternatives; causal commonsense reasoning |
customer_reviews |
HuggingFace | Customer Reviews; product review sentiment binary classification |
dair_emotion |
HuggingFace | DAIR.AI Emotion: 6-class emotion classification from English Twitter messages. |
data_scientist_salary |
Download | The training data and test data comprise of 19802 samples and of 6601 samples each from the |
databench_qa |
HuggingFace | DataBench: data analysis question-answer type classification. 1830 train examples. |
databricks_dolly_15k |
HuggingFace | Databricks Dolly 15K instruction-response pairs with 8 task categories (QA, summarization, etc.). |
dbpedia |
Download | The DBPedia Ontology dataset. |
dbpedia_14 |
HuggingFace | DBpedia 14 ontology text classification; 14 categories |
dolly_15k |
HuggingFace | Databricks Dolly 15K: 15K instruction-response pairs with task category labels. |
emotion |
HuggingFace | Twitter emotion classification; 6 classes: joy, sadness, anger, fear, surprise, love |
enron_spam |
HuggingFace | Enron Spam; email spam/ham classification |
fake_news_detection |
HuggingFace | Fake news detection; real vs fake news articles |
farstail_nli |
HuggingFace | FarsTail: Persian NLI (entailment/neutral/contradiction). 1K test examples. |
fever |
Download | FEVER: a Large-scale Dataset for Fact Extraction and VERification |
fever_gold |
HuggingFace | FEVER fact verification gold evidence. Claim → SUPPORTS/REFUTES/NOT_ENOUGH_INFO. |
financial_phrasebank |
HuggingFace | Financial classification: sentiment analysis of financial text. |
flores_101 |
HuggingFace | FLoRes-101 English evaluation sentences. Topic classification: predict |
goemotions |
Download | GoEmotions: A Dataset for Fine-Grained Emotion Classification. |
google_quest_qa |
Download | Google QUEST Q&A Labeling |
hate_speech18 |
HuggingFace | Hate Speech 18 (SetFit): binary hate speech detection on online forum data. |
hatespeech_offensive |
HuggingFace | Hate Speech and Offensive Language: 25K tweets with 3-class labels |
hellaswag |
HuggingFace | HellaSwag commonsense NLI. Activity + context → correct ending. A/B/C/D. |
imdb_sentiment |
HuggingFace | IMDB movie review sentiment; positive/negative binary classification |
indic_glue |
HuggingFace | IndicGLUE Telugu sentiment classification (actsa-sc). 4328 train examples. |
klue_topic |
HuggingFace | KLUE YNAT; Korean news topic classification; 7 categories |
kmmlu |
HuggingFace | KMMLU: Korean Massive Multitask Language Understanding, 4-way MC. 45 train examples per subset. |
m3cot |
HuggingFace | M3CoT: multimodal multi-domain QA category classification. 7K examples. |
m_mmlu |
HuggingFace | Multilingual MMLU (Arabic): 4-way multiple choice academic questions. 274 train examples. |
medmcqa |
HuggingFace | Medical entrance exam QA; 4-choice medical questions |
melbourne_airbnb |
Download | Melbourne Airbnb Open Data |
mmlu |
HuggingFace | MMLU massive multitask benchmark. 57 tasks, 4-way multiple choice. |
mmlu_lighteval |
HuggingFace | MMLU: Massive Multitask Language Understanding, 57 academic subjects, 4-way MC. 99K auxiliary train. |
mmlu_pro |
HuggingFace | MMLU-Pro harder multitask benchmark. 10-way multiple choice. |
mnli |
HuggingFace | Multi-Genre Natural Language Inference; premise + hypothesis -> entailment/neutral/contradiction |
mrpc |
HuggingFace | Microsoft Research Paraphrase Corpus; paraphrase detection |
mteb_amazon_reviews_class_de |
HuggingFace | MTEB Amazon Reviews Classification (German): 5-class star rating classification. |
mteb_amazon_reviews_class_en |
HuggingFace | MTEB Amazon Reviews Classification (English): 5-class star rating classification. |
mteb_amazon_reviews_class_es |
HuggingFace | MTEB Amazon Reviews Classification (Spanish): 5-class star rating classification. |
mteb_amazon_reviews_class_fr |
HuggingFace | MTEB Amazon Reviews Classification (French): 5-class star rating classification. |
mteb_amazon_reviews_class_ja |
HuggingFace | MTEB Amazon Reviews Classification (Japanese): 5-class star rating classification. |
mteb_amazon_reviews_class_zh |
HuggingFace | MTEB Amazon Reviews Classification (Chinese): 5-class star rating classification. |
mteb_cyrillic_turkic |
HuggingFace | MTEB Cyrillic Turkic Language Classification: language identification for Cyrillic-script Turkic lan |
mteb_emotion |
HuggingFace | MTEB emotion task; 6-class emotion from tweets |
mteb_financial_phrasebank |
HuggingFace | MTEB Financial Phrasebank Classification: financial news sentiment classification. |
mteb_frenk_en |
HuggingFace | MTEB Frenk English Classification: hate speech detection in English. |
mteb_frenk_hr |
HuggingFace | MTEB Frenk Croatian Classification: hate speech detection in Croatian. |
mteb_frenk_sl |
HuggingFace | MTEB Frenk Slovenian Classification: hate speech detection in Slovenian. |
mteb_georeview |
HuggingFace | MTEB Georeview Classification: Russian-language geographic review sentiment classification. |
mteb_greek_legal |
HuggingFace | MTEB Greek Legal Code Classification: Greek legal code topic classification. |
mteb_ita_casehold |
HuggingFace | MTEB ItaCasehold Classification: Italian legal case holding classification. |
mteb_jd_review |
HuggingFace | MTEB JDReview: Chinese product review sentiment classification from JD.com. |
mteb_kor_sarcasm |
HuggingFace | MTEB Korean Sarcasm Classification: sarcasm detection in Korean social media. |
mteb_language_class |
HuggingFace | MTEB Language Classification: language identification from short text. |
mteb_massive_intent_af |
HuggingFace | MTEB MASSIVE Intent Classification (Afrikaans): task-oriented dialog intent prediction. |
mteb_massive_intent_am |
HuggingFace | MTEB MASSIVE Intent Classification (Amharic): task-oriented dialog intent prediction. |
mteb_massive_intent_ar |
HuggingFace | MTEB MASSIVE Intent (Arabic): 60-class intent classification from voice assistant queries. |
mteb_massive_intent_az |
HuggingFace | MTEB MASSIVE Intent Classification (Azerbaijani): task-oriented dialog intent prediction. |
mteb_massive_intent_bn |
HuggingFace | MTEB MASSIVE Intent Classification (Bengali): task-oriented dialog intent prediction. |
mteb_massive_intent_cy |
HuggingFace | MTEB MASSIVE Intent Classification (Welsh): task-oriented dialog intent prediction. |
mteb_massive_intent_da |
HuggingFace | MTEB MASSIVE Intent Classification (Danish): task-oriented dialog intent prediction. |
mteb_massive_intent_de |
HuggingFace | MTEB MASSIVE Intent (German): 60-class intent classification from voice assistant queries. |
mteb_massive_intent_el |
HuggingFace | MTEB MASSIVE Intent Classification (Greek): task-oriented dialog intent prediction. |
mteb_massive_intent_en |
HuggingFace | MTEB MASSIVE Intent (English): 60-class intent classification from voice assistant queries. |
mteb_massive_intent_es |
HuggingFace | MTEB MASSIVE Intent (Spanish): 60-class intent classification from voice assistant queries. |
mteb_massive_intent_fa |
HuggingFace | MTEB MASSIVE Intent Classification (Farsi): task-oriented dialog intent prediction. |
mteb_massive_intent_fi |
HuggingFace | MTEB MASSIVE Intent Classification (Finnish): task-oriented dialog intent prediction. |
mteb_massive_intent_fr |
HuggingFace | MTEB MASSIVE Intent (French): 60-class intent classification from voice assistant queries. |
mteb_massive_intent_he |
HuggingFace | MTEB MASSIVE Intent Classification (Hebrew): task-oriented dialog intent prediction. |
mteb_massive_intent_hi |
HuggingFace | MTEB MASSIVE Intent Classification (Hindi): task-oriented dialog intent prediction. |
mteb_massive_intent_hu |
HuggingFace | MTEB MASSIVE Intent Classification (Hungarian): task-oriented dialog intent prediction. |
mteb_massive_intent_hy |
HuggingFace | MTEB MASSIVE Intent Classification (Armenian): task-oriented dialog intent prediction. |
mteb_massive_intent_id |
HuggingFace | MTEB MASSIVE Intent Classification (Indonesian): task-oriented dialog intent prediction. |
mteb_massive_intent_is |
HuggingFace | MTEB MASSIVE Intent Classification (Icelandic): task-oriented dialog intent prediction. |
mteb_massive_intent_it |
HuggingFace | MTEB MASSIVE Intent Classification (Italian): task-oriented dialog intent prediction. |
mteb_massive_intent_ja |
HuggingFace | MTEB MASSIVE Intent Classification (Japanese): task-oriented dialog intent prediction. |
mteb_massive_intent_jv |
HuggingFace | MTEB MASSIVE Intent Classification (Javanese): task-oriented dialog intent prediction. |
mteb_massive_intent_ka |
HuggingFace | MTEB MASSIVE Intent Classification (Georgian): task-oriented dialog intent prediction. |
mteb_massive_intent_km |
HuggingFace | MTEB MASSIVE Intent Classification (Khmer): task-oriented dialog intent prediction. |
mteb_massive_intent_kn |
HuggingFace | MTEB MASSIVE Intent Classification (Kannada): task-oriented dialog intent prediction. |
mteb_massive_intent_ko |
HuggingFace | MTEB MASSIVE Intent Classification (Korean): task-oriented dialog intent prediction. |
mteb_massive_intent_lv |
HuggingFace | MTEB MASSIVE Intent Classification (Latvian): task-oriented dialog intent prediction. |
mteb_massive_intent_ml |
HuggingFace | MTEB MASSIVE Intent Classification (Malayalam): task-oriented dialog intent prediction. |
mteb_massive_intent_mn |
HuggingFace | MTEB MASSIVE Intent Classification (Mongolian): task-oriented dialog intent prediction. |
mteb_massive_intent_ms |
HuggingFace | MTEB MASSIVE Intent Classification (Malay): task-oriented dialog intent prediction. |
mteb_massive_intent_my |
HuggingFace | MTEB MASSIVE Intent Classification (Burmese): task-oriented dialog intent prediction. |
mteb_massive_intent_nb |
HuggingFace | MTEB MASSIVE Intent Classification (Norwegian): task-oriented dialog intent prediction. |
mteb_massive_intent_nl |
HuggingFace | MTEB MASSIVE Intent Classification (Dutch): task-oriented dialog intent prediction. |
mteb_massive_intent_pl |
HuggingFace | MTEB MASSIVE Intent Classification (Polish): task-oriented dialog intent prediction. |
mteb_massive_intent_pt |
HuggingFace | MTEB MASSIVE Intent Classification (Portuguese): task-oriented dialog intent prediction. |
mteb_massive_intent_ro |
HuggingFace | MTEB MASSIVE Intent Classification (Romanian): task-oriented dialog intent prediction. |
mteb_massive_intent_ru |
HuggingFace | MTEB MASSIVE Intent Classification (Russian): task-oriented dialog intent prediction. |
mteb_massive_intent_sl |
HuggingFace | MTEB MASSIVE Intent Classification (Slovenian): task-oriented dialog intent prediction. |
mteb_massive_intent_sq |
HuggingFace | MTEB MASSIVE Intent Classification (Albanian): task-oriented dialog intent prediction. |
mteb_massive_intent_sv |
HuggingFace | MTEB MASSIVE Intent Classification (Swedish): task-oriented dialog intent prediction. |
mteb_massive_intent_sw |
HuggingFace | MTEB MASSIVE Intent Classification (Swahili): task-oriented dialog intent prediction. |
mteb_massive_intent_ta |
HuggingFace | MTEB MASSIVE Intent Classification (Tamil): task-oriented dialog intent prediction. |
mteb_massive_intent_te |
HuggingFace | MTEB MASSIVE Intent Classification (Telugu): task-oriented dialog intent prediction. |
mteb_massive_intent_th |
HuggingFace | MTEB MASSIVE Intent Classification (Thai): task-oriented dialog intent prediction. |
mteb_massive_intent_tl |
HuggingFace | MTEB MASSIVE Intent Classification (Tagalog): task-oriented dialog intent prediction. |
mteb_massive_intent_tr |
HuggingFace | MTEB MASSIVE Intent Classification (Turkish): task-oriented dialog intent prediction. |
mteb_massive_intent_ur |
HuggingFace | MTEB MASSIVE Intent Classification (Urdu): task-oriented dialog intent prediction. |
mteb_massive_intent_vi |
HuggingFace | MTEB MASSIVE Intent Classification (Vietnamese): task-oriented dialog intent prediction. |
mteb_massive_intent_zh_cn |
HuggingFace | MTEB MASSIVE Intent Classification (Chinese (Simplified)): task-oriented dialog intent prediction. |
mteb_massive_intent_zh_tw |
HuggingFace | MTEB MASSIVE Intent Classification (Chinese (Traditional)): task-oriented dialog intent prediction. |
mteb_massive_scenario_af |
HuggingFace | MTEB MASSIVE Scenario Classification (Afrikaans): task-oriented dialog scenario prediction. |
mteb_massive_scenario_am |
HuggingFace | MTEB MASSIVE Scenario Classification (Amharic): task-oriented dialog scenario prediction. |
mteb_massive_scenario_ar |
HuggingFace | MTEB MASSIVE Scenario Classification (Arabic): task-oriented dialog scenario prediction. |
mteb_massive_scenario_az |
HuggingFace | MTEB MASSIVE Scenario Classification (Azerbaijani): task-oriented dialog scenario prediction. |
mteb_massive_scenario_bn |
HuggingFace | MTEB MASSIVE Scenario Classification (Bengali): task-oriented dialog scenario prediction. |
mteb_massive_scenario_cy |
HuggingFace | MTEB MASSIVE Scenario Classification (Welsh): task-oriented dialog scenario prediction. |
mteb_massive_scenario_da |
HuggingFace | MTEB MASSIVE Scenario Classification (Danish): task-oriented dialog scenario prediction. |
mteb_massive_scenario_de |
HuggingFace | MTEB MASSIVE Scenario (German): 18-class scenario classification from voice assistant queries. |
mteb_massive_scenario_el |
HuggingFace | MTEB MASSIVE Scenario Classification (Greek): task-oriented dialog scenario prediction. |
mteb_massive_scenario_en |
HuggingFace | MTEB MASSIVE Scenario (English): 18-class scenario classification from voice assistant queries. |
mteb_massive_scenario_es |
HuggingFace | MTEB MASSIVE Scenario (Spanish): 18-class scenario classification from voice assistant queries. |
mteb_massive_scenario_fa |
HuggingFace | MTEB MASSIVE Scenario Classification (Farsi): task-oriented dialog scenario prediction. |
mteb_massive_scenario_fi |
HuggingFace | MTEB MASSIVE Scenario Classification (Finnish): task-oriented dialog scenario prediction. |
mteb_massive_scenario_fr |
HuggingFace | MTEB MASSIVE Scenario (French): 18-class scenario classification from voice assistant queries. |
mteb_massive_scenario_he |
HuggingFace | MTEB MASSIVE Scenario Classification (Hebrew): task-oriented dialog scenario prediction. |
mteb_massive_scenario_hi |
HuggingFace | MTEB MASSIVE Scenario Classification (Hindi): task-oriented dialog scenario prediction. |
mteb_massive_scenario_hu |
HuggingFace | MTEB MASSIVE Scenario Classification (Hungarian): task-oriented dialog scenario prediction. |
mteb_massive_scenario_hy |
HuggingFace | MTEB MASSIVE Scenario Classification (Armenian): task-oriented dialog scenario prediction. |
mteb_massive_scenario_id |
HuggingFace | MTEB MASSIVE Scenario Classification (Indonesian): task-oriented dialog scenario prediction. |
mteb_massive_scenario_is |
HuggingFace | MTEB MASSIVE Scenario Classification (Icelandic): task-oriented dialog scenario prediction. |
mteb_massive_scenario_it |
HuggingFace | MTEB MASSIVE Scenario Classification (Italian): task-oriented dialog scenario prediction. |
mteb_massive_scenario_ja |
HuggingFace | MTEB MASSIVE Scenario Classification (Japanese): task-oriented dialog scenario prediction. |
mteb_massive_scenario_jv |
HuggingFace | MTEB MASSIVE Scenario Classification (Javanese): task-oriented dialog scenario prediction. |
mteb_massive_scenario_ka |
HuggingFace | MTEB MASSIVE Scenario Classification (Georgian): task-oriented dialog scenario prediction. |
mteb_massive_scenario_km |
HuggingFace | MTEB MASSIVE Scenario Classification (Khmer): task-oriented dialog scenario prediction. |
mteb_massive_scenario_kn |
HuggingFace | MTEB MASSIVE Scenario Classification (Kannada): task-oriented dialog scenario prediction. |
mteb_massive_scenario_ko |
HuggingFace | MTEB MASSIVE Scenario Classification (Korean): task-oriented dialog scenario prediction. |
mteb_massive_scenario_lv |
HuggingFace | MTEB MASSIVE Scenario Classification (Latvian): task-oriented dialog scenario prediction. |
mteb_massive_scenario_ml |
HuggingFace | MTEB MASSIVE Scenario Classification (Malayalam): task-oriented dialog scenario prediction. |
mteb_massive_scenario_mn |
HuggingFace | MTEB MASSIVE Scenario Classification (Mongolian): task-oriented dialog scenario prediction. |
mteb_massive_scenario_ms |
HuggingFace | MTEB MASSIVE Scenario Classification (Malay): task-oriented dialog scenario prediction. |
mteb_massive_scenario_my |
HuggingFace | MTEB MASSIVE Scenario Classification (Burmese): task-oriented dialog scenario prediction. |
mteb_massive_scenario_nb |
HuggingFace | MTEB MASSIVE Scenario Classification (Norwegian): task-oriented dialog scenario prediction. |
mteb_massive_scenario_nl |
HuggingFace | MTEB MASSIVE Scenario Classification (Dutch): task-oriented dialog scenario prediction. |
mteb_massive_scenario_pl |
HuggingFace | MTEB MASSIVE Scenario Classification (Polish): task-oriented dialog scenario prediction. |
mteb_massive_scenario_pt |
HuggingFace | MTEB MASSIVE Scenario Classification (Portuguese): task-oriented dialog scenario prediction. |
mteb_massive_scenario_ro |
HuggingFace | MTEB MASSIVE Scenario Classification (Romanian): task-oriented dialog scenario prediction. |
mteb_massive_scenario_ru |
HuggingFace | MTEB MASSIVE Scenario Classification (Russian): task-oriented dialog scenario prediction. |
mteb_massive_scenario_sl |
HuggingFace | MTEB MASSIVE Scenario Classification (Slovenian): task-oriented dialog scenario prediction. |
mteb_massive_scenario_sq |
HuggingFace | MTEB MASSIVE Scenario Classification (Albanian): task-oriented dialog scenario prediction. |
mteb_massive_scenario_sv |
HuggingFace | MTEB MASSIVE Scenario Classification (Swedish): task-oriented dialog scenario prediction. |
mteb_massive_scenario_sw |
HuggingFace | MTEB MASSIVE Scenario Classification (Swahili): task-oriented dialog scenario prediction. |
mteb_massive_scenario_ta |
HuggingFace | MTEB MASSIVE Scenario Classification (Tamil): task-oriented dialog scenario prediction. |
mteb_massive_scenario_te |
HuggingFace | MTEB MASSIVE Scenario Classification (Telugu): task-oriented dialog scenario prediction. |
mteb_massive_scenario_th |
HuggingFace | MTEB MASSIVE Scenario Classification (Thai): task-oriented dialog scenario prediction. |
mteb_massive_scenario_tl |
HuggingFace | MTEB MASSIVE Scenario Classification (Tagalog): task-oriented dialog scenario prediction. |
mteb_massive_scenario_tr |
HuggingFace | MTEB MASSIVE Scenario Classification (Turkish): task-oriented dialog scenario prediction. |
mteb_massive_scenario_ur |
HuggingFace | MTEB MASSIVE Scenario Classification (Urdu): task-oriented dialog scenario prediction. |
mteb_massive_scenario_vi |
HuggingFace | MTEB MASSIVE Scenario Classification (Vietnamese): task-oriented dialog scenario prediction. |
mteb_massive_scenario_zh_cn |
HuggingFace | MTEB MASSIVE Scenario (Chinese Simplified): 18-class scenario classification. |
mteb_massive_scenario_zh_tw |
HuggingFace | MTEB MASSIVE Scenario Classification (Chinese (Traditional)): task-oriented dialog scenario predicti |
mteb_mtop_domain_de |
HuggingFace | MTEB MTOP Domain Classification (German): task-oriented dialog domain prediction. |
mteb_mtop_domain_en |
HuggingFace | MTEB MTOP Domain Classification (English): task-oriented dialog domain prediction. |
mteb_mtop_domain_es |
HuggingFace | MTEB MTOP Domain Classification (Spanish): task-oriented dialog domain prediction. |
mteb_mtop_domain_fr |
HuggingFace | MTEB MTOP Domain Classification (French): task-oriented dialog domain prediction. |
mteb_mtop_domain_hi |
HuggingFace | MTEB MTOP Domain Classification (Hindi): task-oriented dialog domain prediction. |
mteb_mtop_domain_th |
HuggingFace | MTEB MTOP Domain Classification (Thai): task-oriented dialog domain prediction. |
mteb_mtop_intent_de2 |
HuggingFace | MTEB MTOP Intent Classification (German): task-oriented dialog intent prediction. |
mteb_mtop_intent_en |
HuggingFace | MTEB MTOP Intent Classification (English): task-oriented dialog intent prediction. |
mteb_mtop_intent_es2 |
HuggingFace | MTEB MTOP Intent Classification (Spanish): task-oriented dialog intent prediction. |
mteb_mtop_intent_fr2 |
HuggingFace | MTEB MTOP Intent Classification (French): task-oriented dialog intent prediction. |
mteb_mtop_intent_hi2 |
HuggingFace | MTEB MTOP Intent Classification (Hindi): task-oriented dialog intent prediction. |
mteb_mtop_intent_th2 |
HuggingFace | MTEB MTOP Intent Classification (Thai): task-oriented dialog intent prediction. |
mteb_multilingual_sentiment |
HuggingFace | MTEB Multilingual Sentiment Classification: multilingual product review sentiment. |
mteb_naija_senti_hau |
HuggingFace | MTEB NaijaSenti (Hausa): Nigerian language sentiment classification. |
mteb_naija_senti_ibo |
HuggingFace | MTEB NaijaSenti (Igbo): Nigerian language sentiment classification. |
mteb_naija_senti_pcm |
HuggingFace | MTEB NaijaSenti (Nigerian Pidgin): Nigerian language sentiment classification. |
mteb_naija_senti_yor |
HuggingFace | MTEB NaijaSenti (Yoruba): Nigerian language sentiment classification. |
mteb_nepali_news |
HuggingFace | MTEB Nepali News Classification: news category classification in Nepali. |
mteb_nordic_lang |
HuggingFace | MTEB Nordic Language Classification: language identification for Nordic languages. |
mteb_online_shopping |
HuggingFace | MTEB OnlineShopping: Chinese online shopping review sentiment classification. |
mteb_poem_sentiment |
HuggingFace | MTEB Poem Sentiment Classification: sentiment classification of poem verses. |
mteb_sensitive_topics |
HuggingFace | MTEB Sensitive Topics Classification: sensitive topic detection in text. |
mteb_sentiment_hindi |
HuggingFace | MTEB Sentiment Analysis Hindi: sentiment classification of Hindi text. |
mteb_swiss_judgement_de |
HuggingFace | MTEB Swiss Judgement Classification (German): Swiss Federal Supreme Court judgement prediction. |
mteb_swiss_judgement_fr |
HuggingFace | MTEB Swiss Judgement Classification (French): Swiss Federal Supreme Court judgement prediction. |
mteb_swiss_judgement_it |
HuggingFace | MTEB Swiss Judgement Classification (Italian): Swiss Federal Supreme Court judgement prediction. |
mteb_tnews |
HuggingFace | MTEB TNews: Chinese news topic classification dataset. |
mteb_turkish_product |
HuggingFace | MTEB Turkish Product Sentiment Classification: Turkish product review sentiment. |
mteb_tweet_sentiment |
HuggingFace | MTEB Tweet Sentiment Extraction: 3-class tweet sentiment classification. |
mteb_tweet_topic |
HuggingFace | MTEB Tweet Topic Single Classification: single-label topic classification of tweets. |
mteb_waimai |
HuggingFace | MTEB Waimai: Chinese food delivery review sentiment classification. |
mteb_yahoo_answers |
HuggingFace | MTEB Yahoo Answers Topics Classification: topic classification of Yahoo Answers questions. |
multi_nli |
HuggingFace | Multi-Genre NLI; 10 diverse genres, 3-way NLI |
nemotron_pii |
HuggingFace | Nemotron PII: document classification by domain type. 100K examples. |
nemotron_safety |
HuggingFace | Nemotron Safety Guard v3: prompt/response safety classification. 451K examples. |
news_category |
HuggingFace | News Category Dataset: 210K Huffington Post headlines with 42 categories. |
news_channel |
Download | Online News Popularity Data Set |
nli_zh_all |
HuggingFace | NLI-ZH: Chinese Natural Language Inference dataset merging multiple Chinese |
no_robots |
HuggingFace | No Robots: 10K high-quality instruction-following conversations with category labels. |
numinamath |
HuggingFace | NuminaMath: competition math problems with verified answers. 131K train examples. |
oasst1 |
HuggingFace | OpenAssistant OASST1: 161K messages from 35K conversation trees; role classification. |
ohsumed_7400 |
Kaggle | Ohsumed corpus is extracted from MEDLINE database. MEDLINE is designed for multi-label classificatio |
ohsumed_cmu |
Download | OHSUMED is a well-known medical abstracts dataset. It contains 348,566 references, |
openbookqa |
HuggingFace | OpenBookQA elementary science 4-way multiple choice. |
or_bench |
HuggingFace | OR-Bench: over-refusal benchmark prompt classification. 80K examples. |
paws |
HuggingFace | Paraphrase Adversaries from Word Scrambling; challenging paraphrase detection |
paws_x |
HuggingFace | PAWS-X multilingual paraphrase identification. 49K train examples. |
poem_sentiment |
HuggingFace | Poem Sentiment: verse-level sentiment classification from English poetry. |
poem_sentiment_hf |
HuggingFace | Poem Sentiment (Google Research): verse-level sentiment from English poetry. |
product_sentiment_machine_hack |
Download | We challenge the machinehackers community to develop a machine learning model |
pubmed_qa |
HuggingFace | PubMedQA biomedical QA. Context + question → yes/no/maybe. |
qasc |
HuggingFace | QASC: 8-way MC QA with science facts. 8134 train examples. |
qnli |
HuggingFace | Question-answering NLI; whether context sentence contains answer to question |
qqp |
HuggingFace | Quora Question Pairs; whether two questions are semantically equivalent |
race |
HuggingFace | RACE: large-scale reading comprehension from Chinese English exams. 87K train examples. 4-way MC. |
reuters_cmu |
Download | Reuters-21578 is a well-known newswire dataset containing 21,578 documents. |
reuters_r8 |
Kaggle | Reuters R8 subset of Reuters 21578 dataset from Kaggle. |
reward_bench |
HuggingFace | RewardBench: preference evaluation - predict which model response is chosen/preferred. |
rotten_tomatoes |
HuggingFace | Rotten Tomatoes movie review sentiment; positive/negative |
rte |
HuggingFace | Recognizing Textual Entailment; entailment vs not-entailment |
scienceqa_vqa |
HuggingFace | ScienceQA: science question + optional lecture → multiple choice answer. |
scotus_classification |
HuggingFace | LexGLUE SCOTUS; US Supreme Court opinion issue area classification |
setfit_ag_news |
HuggingFace | SetFit AG News: 4-class news topic classification (World, Sports, Business, Sci/Tech). |
setfit_emotion |
HuggingFace | SetFit Emotion: 6-class emotion classification from English Twitter messages. |
setfit_yelp_review |
HuggingFace | SetFit Yelp Review Full: 5-class star rating classification from Yelp reviews. |
sib200 |
HuggingFace | SIB-200: multilingual 7-class topic classification in 205 languages. 143K train examples. |
sms_spam |
HuggingFace | SMS spam detection: ham or spam binary classification. 5574 examples. |
snli |
HuggingFace | Stanford NLI; premise/hypothesis pairs -> entailment/neutral/contradiction |
spotify_tracks |
HuggingFace | Spotify tracks: predict genre from audio features. 114K examples. |
sst2_hf |
HuggingFace | Stanford Sentiment Treebank 2-class (HF canonical version) |
sst3 |
HuggingFace | Three-class sentiment dataset (negative/neutral/positive). |
sst5 |
HuggingFace | The SST5 dataset (Stanford Sentiment Treebank, 5-class). |
sst5_setfit |
HuggingFace | SST-5 fine-grained sentiment (SetFit version); 5 classes |
superglue_rte |
HuggingFace | SuperGLUE version of Recognizing Textual Entailment |
synthia |
HuggingFace | SYNTHIA: synthetic instructional examples across 20 academic fields. 1.2M examples. |
tweet_eval_emoji |
HuggingFace | TweetEval emoji: predict emoji from tweet text. 20-class classification. |
tweet_sentiment_extraction |
HuggingFace | Tweet sentiment extraction; positive/negative/neutral |
tweeteval_emotion |
HuggingFace | TweetEval emotion classification; 4 classes |
tweeteval_hate |
HuggingFace | TweetEval hate speech detection |
tweeteval_irony |
HuggingFace | TweetEval irony detection |
tweeteval_offensive |
HuggingFace | TweetEval offensive language detection |
tweeteval_sentiment |
HuggingFace | TweetEval sentiment; positive/negative/neutral tweet classification |
tweeteval_stance |
HuggingFace | TweetEval stance detection; against/in favor/neutral |
twitter_financial_news_topic |
HuggingFace | Twitter financial news 20-class topic classification. 17K train examples. |
wic |
HuggingFace | Word-in-Context; word sense disambiguation as binary classification |
wiki_qa |
HuggingFace | WikiQA: answer sentence selection (relevant/not-relevant). 20K train examples. |
wildchat |
HuggingFace | WildChat: 570K real ChatGPT user interactions; language + toxicity classification. |
winograd_schema |
HuggingFace | Winograd Schema Challenge; pronoun coreference resolution |
winogrande_hf |
HuggingFace | WinoGrande (allenai): large-scale commonsense reasoning benchmark (273K examples). |
wnli |
HuggingFace | Winograd NLI; pronoun reference coreference classification |
xnli |
HuggingFace | XNLI cross-lingual NLI. Premise+hypothesis -> entailment/neutral/contradiction. |
xnli_de |
HuggingFace | XNLI German: cross-lingual natural language inference, German split. |
xnli_en |
HuggingFace | XNLI English: cross-lingual natural language inference, English split. |
xnli_es |
HuggingFace | XNLI Spanish: cross-lingual natural language inference, Spanish split. |
xnli_fr |
HuggingFace | XNLI French: cross-lingual natural language inference, French split. |
xnli_zh |
HuggingFace | XNLI Chinese: cross-lingual natural language inference, Chinese split. |
yahoo_answers |
Download | The Yahoo Answers dataset |
yahoo_answers_topics |
HuggingFace | Yahoo Answers 10-class topic classification. 1.4M train examples. |
yelp_polarity |
HuggingFace | Yelp binary sentiment classification: 1 (negative) or 2 (positive). 560K train examples. |
yelp_review_full |
HuggingFace | Yelp review 5-star rating classification |
yelp_reviews |
Download | The Yelp Reviews dataset |
Text Generation / Summarization / Translation (82 datasets)¶
| Dataset | Source | Description |
|---|---|---|
aeslc |
HuggingFace | AESLC: annotated email subject line corpus for email body → subject summarization. |
alpaca |
Download | Stanford Alpaca instruction-tuning dataset (https://github.com/tatsu-lab/stanford_alpaca) for LLM fi |
alpaca_cleaned |
HuggingFace | Alpaca Cleaned: 52K instruction-following samples, cleaned version of Stanford Alpaca. |
alpaca_gpt4 |
HuggingFace | Alpaca GPT-4: 52K instruction-following examples. |
alpaca_gpt4_zh |
HuggingFace | Alpaca GPT-4 Chinese: 42K Chinese instruction-following examples. |
ambig_qa |
HuggingFace | AmbigQA: open-domain QA with ambiguous questions. Text-in text-out task. |
arxiv_abstracts_2021 |
HuggingFace | ArXiv abstracts 2021: predict abstract from title. 2M train examples. |
arxiv_summarization |
HuggingFace | ArXiv document summarization: predict abstract from full paper text. 203K train examples. |
bbh |
HuggingFace | Big-Bench Hard boolean expressions. Input → target. |
big_patent |
HuggingFace | BigPatent: patent claim summarization (category A). 1.2M train examples. |
bigbench |
HuggingFace | BigBench: abstract narrative understanding task. 2400 train examples. |
billsum |
HuggingFace | BillSum; US congressional and California bill summarization |
bornholm_bitext |
HuggingFace | Bornholm Bitext Mining: Danish-Bornholmsk low-resource translation. 5785 examples. |
cmrc2018 |
HuggingFace | CMRC 2018: Chinese machine reading comprehension (SQuAD-style). 10K train examples. |
cnn_dailymail |
HuggingFace | CNN/DailyMail news summarization. Article -> highlights. ~300K examples. |
cnn_dm_hf |
HuggingFace | CNN/DailyMail (abisee): news article summarization, 3.0.0 version (287K examples). |
code_alpaca |
Download | This dataset, created by sahil280114, aims to build and share an instruction-following LLaMA model f |
code_search_net |
HuggingFace | CodeSearchNet Python: function code → docstring (text generation). |
codex_thinking |
HuggingFace | CodeX 2M: code generation with chain-of-thought reasoning. 2.2M examples. |
codexglue_code_to_text |
HuggingFace | CodeXGlue Python code → docstring generation. |
consumer_complaints |
Kaggle | The dataset contains different information of complaints that customers have made about a multiple p |
dialogsum |
HuggingFace | DialogSum dialogue summarization. Dialogue -> summary. ~13K examples. |
drop |
HuggingFace | DROP: Discrete Reasoning Over Paragraphs reading comprehension. 77K train examples. |
duorc |
HuggingFace | DuoRC SelfRC: movie plot + question → answer text. |
europarl_bg_cs |
HuggingFace | Europarl: European Parliament proceedings Bulgarian-Czech translation. 400K train pairs. |
europarl_bg_en |
HuggingFace | Europarl: European Parliament proceedings Bulgarian-English translation. |
europarl_cs_en |
HuggingFace | Europarl: European Parliament proceedings Czech-English translation. |
europarl_da_en |
HuggingFace | Europarl: European Parliament proceedings Danish-English translation. |
europarl_de_en |
HuggingFace | Europarl: European Parliament proceedings German-English translation. |
europarl_el_en |
HuggingFace | Europarl: European Parliament proceedings Greek-English translation. |
europarl_en_es |
HuggingFace | Europarl: European Parliament proceedings English-Spanish translation. |
europarl_en_fr |
HuggingFace | Europarl: European Parliament proceedings English-French translation. |
europarl_en_it |
HuggingFace | Europarl: European Parliament proceedings English-Italian translation. |
europarl_en_nl |
HuggingFace | Europarl: European Parliament proceedings English-Dutch translation. |
europarl_en_pl |
HuggingFace | Europarl: European Parliament proceedings English-Polish translation. |
europarl_en_pt |
HuggingFace | Europarl: European Parliament proceedings English-Portuguese translation. |
europarl_en_ro |
HuggingFace | Europarl: European Parliament proceedings English-Romanian translation. |
europarl_en_sv |
HuggingFace | Europarl: European Parliament proceedings English-Swedish translation. |
flashrag_2wikimultihop |
HuggingFace | FlashRAG 2WikiMultiHopQA: multi-hop QA. 15K train examples. |
gaiasky_qa |
HuggingFace | Gaiasky astronomy Q&A dataset. 3.8K examples. |
govreport_summarization |
HuggingFace | GovReport: long government report summarization. 17K train examples. |
gsm8k |
HuggingFace | GSM8K; grade school math word problems with step-by-step solutions |
gsm8k_openai |
HuggingFace | GSM8K (OpenAI): grade school math word problems (8.5K problems). |
hh_rlhf |
HuggingFace | Anthropic HH-RLHF: 170K human feedback pairs for helpful/harmless RLHF training. |
hotpot_qa |
HuggingFace | HotpotQA multi-hop reasoning QA. Question → answer. |
kilt_nq |
HuggingFace | KILT NQ: Natural Questions in the KILT knowledge-intensive framework. 87K train examples. |
language_identification |
HuggingFace | Language identification; 20 languages from Twitter data |
math500 |
HuggingFace | MATH-500 test subset; competition math with step-by-step solutions |
mathvista |
HuggingFace | MathVista: math reasoning over images. Question → answer. |
mbpp |
HuggingFace | MBPP Python problem description → solution code generation. 500 problems. |
medical_flashcards |
HuggingFace | Medical flashcards: Q&A for medical topics. 34K examples. |
multi30k |
HuggingFace | Multi30k: English-German image caption translation. 29K train pairs. |
multiun_ar_en |
HuggingFace | MultiUN Arabic-English United Nations document translation. 9.8M train pairs. |
natural_questions |
HuggingFace | Natural Questions: real Google search questions with Wikipedia answers. |
natural_questions_hard_negatives |
HuggingFace | Natural Questions hard negatives for retrieval/ranking |
naver_news_summary |
HuggingFace | Naver News Korean summarization dataset |
opus100_en_es |
HuggingFace | OPUS-100 English-Spanish parallel corpus. ~1M sentence pairs. |
opus100_en_fr |
HuggingFace | OPUS-100 English-French parallel corpus. ~1M sentence pairs. |
opus_books_en_fr |
HuggingFace | OPUS Books English-French literary translations. |
orca_dpo_pairs |
HuggingFace | Intel Orca DPO Pairs: preference dataset for direct preference optimization. |
orca_math |
HuggingFace | OrcaMath: 200K math word problems with solutions. |
phinc |
HuggingFace | PHINC: Hindi-English parallel corpus. 13K train pairs. |
pubmed_summarization |
HuggingFace | PubMed biomedical document summarization. 120K train examples. |
python_code_instructions |
HuggingFace | Python code generation from instructions. 18K examples. |
samsum |
HuggingFace | SAMSum: dialogue summarization. 14K train examples. |
sciq |
HuggingFace | SciQ science QA with support text. Support + question → correct_answer (text). |
scitail |
HuggingFace | SciTail; science textual entailment from multiple-choice questions |
setimes_bg_bs |
HuggingFace | SETimes: South-East European Times Bulgarian-Bosnian translation. 136K train pairs. |
squad |
HuggingFace | SQuAD v1.1 extractive QA. Context + question → answer text. 87K examples. |
squad_v2 |
HuggingFace | SQuAD v2 extractive QA with unanswerable questions. 130K examples. |
tofu |
HuggingFace | TOFU: Fictitious Unlearning. Fictional author QA dataset for LLM unlearning research. 4K train examp |
trivia_qa |
HuggingFace | TriviaQA reading comprehension. Question → answer. |
truthful_qa |
HuggingFace | TruthfulQA: question → best truthful answer. 817 questions. |
vukuzenzele |
HuggingFace | Vukuzenzele: Afrikaans-English sentence pairs. 2.7K train examples. |
web_questions |
HuggingFace | WebQuestions: Freebase-grounded factoid QA. 3.8K train examples. |
winogrande |
HuggingFace | WinoGrande; large-scale Winograd schema challenge |
wmt14_de_en |
HuggingFace | WMT14 German-English news translation. ~4.5M sentence pairs. |
wmt16_de_en |
HuggingFace | WMT16 German-English news translation. |
wmt19_de_en |
HuggingFace | WMT19 German-English news translation. |
wmt_t2t_de_en |
HuggingFace | WMT T2T German-English news translation. 4.6M train sentence pairs. |
xsum |
HuggingFace | XSum BBC news summarization. Document -> one-sentence summary. ~200K examples. |
xsum_hf |
HuggingFace | XSum (Edinburgh NLP): extreme summarization of BBC news articles (226K examples). |
Text Regression (91 datasets)¶
| Dataset | Source | Description |
|---|---|---|
adult_income_hf |
HuggingFace | Adult/Census Income dataset for income >50K classification |
ae_price_prediction |
Download | Innerwear Data from Victoria's Secret and Others |
allocine |
HuggingFace | AlloCine: French movie review sentiment dataset. Binary positive/negative labels. |
amazon_review_polarity |
Download | The Amazon Reviews Polarity dataset |
amazon_reviews_2023 |
HuggingFace | Amazon Reviews 2023: predict star rating from review title and text. 571M train examples. |
app_reviews |
HuggingFace | App Reviews: 288K mobile app reviews with 1-5 star ratings. |
beavertails |
HuggingFace | BeaverTails LLM safety: prompt+response → is_safe binary + category. |
blimp |
HuggingFace | BLiMP: Benchmark of Linguistic Minimal Pairs. Binary grammaticality |
bookprice_prediction |
Download | Here we explore a database of books of different genres, from thousands of authors. |
boolq_standalone |
HuggingFace | BoolQ standalone; naturally occurring yes/no questions with passage |
california_house_price |
Download | Predict house sale prices based on the house information, such as # of bedrooms, |
civil_comments |
HuggingFace | Civil Comments toxicity classification; multi-attribute toxicity labels |
code_defect_detection |
HuggingFace | CodeXGLUE: binary defect detection in code functions. 21K train examples. |
factcheck |
HuggingFace | FactCheck: multilingual fact-checking question ranking dataset. 2M train examples. |
fake_job_postings2 |
Download | This dataset contains 18K job descriptions out of which about 800 are fake. |
fineweb_edu |
HuggingFace | FineWeb-Edu: 1.3T token high-quality educational web text; 10BT sample subset. |
germeval18 |
HuggingFace | GermEval 2018: German offensive language detection. Binary and multi-class labels. |
google_qa_answer_type_reason_explanation |
Download | Google QUEST Q&A Labeling |
google_qa_question_type_reason_explanation |
Download | Google QUEST Q&A Labeling |
hc3 |
HuggingFace | Human ChatGPT Comparison Corpus (HC3): binary classification of whether an |
hc3_chinese |
HuggingFace | HC3-Chinese: Chinese Human ChatGPT Comparison Corpus. Binary classification |
helpsteer2 |
HuggingFace | HelpSteer2: 21K prompt-response pairs with 5 quality attributes rated 0-4. |
imdb |
Kaggle | IMDB dataset having 50K movie reviews for natural language processing or Text analytics. |
imdb_genre_prediction |
Download | A data set of 1,000 most popular movies on IMDB in the last 10 years. The data points included are: |
imdb_mteb |
HuggingFace | IMDB movie review sentiment binary classification (positive/negative). 24K train examples. |
irony |
Download | The Reddit Irony dataset. |
jc_penney_products |
Download | JCPenney products |
jigsaw_unintended_bias |
Download | A dataset labeled for identity mentions and optimizing a metric designed to measure unintended bias. |
jigsaw_unintended_bias100k |
Download | A dataset labeled for identity mentions and optimizing a metric designed to measure unintended bias. |
kick_starter_funding |
Download | Funding Successful Projects on Kickstarter |
klue_sts |
HuggingFace | KLUE STS Korean sentence similarity. Sentence pair → score (0-5 regression). |
lmsys_arena |
HuggingFace | LMSYS Arena: 55K human preference comparisons between LLM responses. |
measuring_hate_speech |
HuggingFace | Measuring Hate Speech; continuous hate speech score regression |
mercari_price_suggestion |
Download | Predict product price based on details like product category name, brand name, and item condition. |
mercari_price_suggestion100K |
Download | Predict product price based on details like product category name, brand name, and item condition. |
moral_stories |
HuggingFace | Moral Stories: binary classification of moral vs immoral actions given a norm, |
msmarco_passage |
HuggingFace | MS MARCO passage retrieval; query-passage relevance scoring |
mteb_amazon_polarity |
HuggingFace | MTEB Amazon Polarity: binary positive/negative sentiment from Amazon reviews (4M reviews). |
mteb_biosses |
HuggingFace | MTEB BIOSSES: biomedical sentence similarity benchmark (100 sentence pairs). |
mteb_imdb |
HuggingFace | MTEB IMDB: binary movie review sentiment classification. |
mteb_sts17_ar |
HuggingFace | MTEB STS17 (Arabic-Arabic): Arabic semantic textual similarity. |
mteb_sts17_de |
HuggingFace | MTEB STS17 (German-English): cross-lingual semantic textual similarity. |
mteb_sts17_en |
HuggingFace | MTEB STS17 (English-English): semantic textual similarity regression. |
mteb_sts17_es |
HuggingFace | MTEB STS17 (Spanish-English): cross-lingual semantic textual similarity. |
mteb_sts17_fr |
HuggingFace | MTEB STS17 (French-English): cross-lingual semantic textual similarity. |
mteb_sts22_ar |
HuggingFace | MTEB STS22 (Arabic): semantic textual similarity regression. |
mteb_sts22_de |
HuggingFace | MTEB STS22 (German): semantic textual similarity regression. |
mteb_sts22_de_en |
HuggingFace | MTEB STS22 (German-English cross-lingual): semantic textual similarity. |
mteb_sts22_de_fr |
HuggingFace | MTEB STS22 (German-French cross-lingual): semantic textual similarity. |
mteb_sts22_en |
HuggingFace | MTEB STS22 (English): semantic textual similarity regression. |
mteb_sts22_es |
HuggingFace | MTEB STS22 (Spanish): semantic textual similarity regression. |
mteb_sts22_es_en |
HuggingFace | MTEB STS22 (Spanish-English cross-lingual): semantic textual similarity. |
mteb_sts22_es_it |
HuggingFace | MTEB STS22 (Spanish-Italian cross-lingual): semantic textual similarity. |
mteb_sts22_fr |
HuggingFace | MTEB STS22 (French): semantic textual similarity regression. |
mteb_sts22_it |
HuggingFace | MTEB STS22 (Italian): semantic textual similarity regression. |
mteb_sts22_pl |
HuggingFace | MTEB STS22 (Polish): semantic textual similarity regression. |
mteb_sts22_pl_en |
HuggingFace | MTEB STS22 (Polish-English cross-lingual): semantic textual similarity. |
mteb_sts22_ru |
HuggingFace | MTEB STS22 (Russian): semantic textual similarity regression. |
mteb_sts22_tr |
HuggingFace | MTEB STS22 (Turkish): semantic textual similarity regression. |
mteb_sts22_zh |
HuggingFace | MTEB STS22 (Chinese): semantic textual similarity regression. |
mteb_sts22_zh_en |
HuggingFace | MTEB STS22 (Chinese-English cross-lingual): semantic textual similarity. |
mteb_stsbenchmark |
HuggingFace | MTEB STSBenchmark: STS Benchmark semantic similarity (8K sentence pairs, scores 0-5). |
mteb_toxic_convo |
HuggingFace | MTEB Toxic Conversations 50K: binary toxicity classification. |
multirc |
HuggingFace | SuperGLUE MultiRC: paragraph + question + answer → binary (answer correct?). |
news_popularity2 |
Download | Online News Popularity Data Set |
persuasion |
HuggingFace | Anthropic Persuasion: rate how persuasive arguments are for various claims. |
sarcastic_headlines |
Kaggle | A dataset to determine if a news headline is sarcastic or serious. |
scandisent |
HuggingFace | ScandiSent: Nordic language sentiment classification. 50K train examples. |
setfit_amazon_polarity |
HuggingFace | SetFit Amazon Polarity: binary positive/negative sentiment from Amazon reviews. |
setfit_mrpc |
HuggingFace | SetFit MRPC: Microsoft Research Paraphrase Corpus, binary paraphrase detection. |
setfit_sst2 |
HuggingFace | SetFit SST-2: Stanford Sentiment Treebank binary sentiment (SetFit format). |
setfit_subj |
HuggingFace | SetFit SUBJ: subjectivity detection (subjective vs objective sentences). |
sickr |
HuggingFace | SICK-R: sentences involving compositional knowledge. 9927 test examples. |
sst2 |
HuggingFace | The SST2 dataset (Stanford Sentiment Treebank, binary). |
stackoverflow_posts |
HuggingFace | Stack Overflow posts: predict question score from title and body. 58M train examples. |
sts12 |
HuggingFace | STS 2012: semantic textual similarity benchmark. 2234 train examples. |
sts13 |
HuggingFace | STS 2013: semantic textual similarity. 1500 test examples. |
sts14 |
HuggingFace | STS 2014: semantic textual similarity. 3750 test examples. |
sts15 |
HuggingFace | STS 2015: semantic textual similarity. 3000 test examples. |
sts16 |
HuggingFace | STS 2016: semantic textual similarity. 1186 test examples. |
sts17 |
HuggingFace | STS 2017: cross-lingual STS. 5346 test examples. |
sts22 |
HuggingFace | STS 2022: cross-lingual semantic textual similarity. 4.6K train examples. |
sts_benchmark |
HuggingFace | STS Benchmark; sentence pair semantic similarity scoring |
stsb |
HuggingFace | Semantic Textual Similarity Benchmark; similarity score 0-5 |
stsb_de |
HuggingFace | STS Benchmark German (machine translated). 5749 train examples. |
stsb_sentencetransformers |
HuggingFace | STS Benchmark: semantic textual similarity scoring (0-5 scale). 5.7K train examples. |
student_performance |
HuggingFace | Student performance regression; predict final grade from demographics |
toxic_chat |
HuggingFace | LMSys ToxicChat: human toxicity annotation for LLM conversations. |
wine_reviews |
Download | Wine Reviews |
women_clothing_review |
Download | Women's E-Commerce Clothing Reviews |
yelp_review_polarity |
Download | The Yelp Polarity dataset |
Sequence Tagging / NER (12 datasets)¶
| Dataset | Source | Description |
|---|---|---|
acronym_identification |
HuggingFace | Acronym identification: tokens → B-long/I-long/B-short/I-short/O tags. |
audioset_balanced |
HuggingFace | AudioSet balanced: audio event classification with 527 sound classes from YouTube. 18K train example |
few_nerd |
HuggingFace | Few-NERD fine-grained NER with 8 coarse and 66 fine entity types. |
multinerd |
HuggingFace | MultiNERD multilingual NER. 31 fine-grained entity types. CC BY 4.0. |
nq_open |
HuggingFace | Natural Questions Open: open-domain QA with Wikipedia answers. 88K train examples. |
pii_masking |
HuggingFace | PII masking: source text → BIO labels for PII tokens. |
universal_dependencies |
HuggingFace | Universal Dependencies English EWT: POS tagging with UPOS tags. 12K train sentences. |
wikiann |
HuggingFace | WikiANN (PAN-X) Named Entity Recognition — English |
wikiann_de |
HuggingFace | WikiANN German NER. IOB2 tags: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC. |
wikiann_en |
HuggingFace | WikiANN English NER. IOB2 tags: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC. |
wikiann_zh |
HuggingFace | WikiANN Chinese NER. IOB2 tags: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC. |
winobias |
HuggingFace | WinoBias coreference gender bias. Sentence → label. |
Multi-label Classification (4 datasets)¶
| Dataset | Source | Description |
|---|---|---|
go_emotions |
HuggingFace | GoEmotions: Multi-label Emotion Classification |
go_emotions_multiclass |
HuggingFace | GoEmotions multi-label emotion (28 classes). Set output. Same as go_emotions but listed separately. |
lex_glue_ecthr |
HuggingFace | LexGLUE ECtHR case text → violated ECHR articles (multi-label). Set output. |
lex_glue_eurlex |
HuggingFace | LexGLUE EuroLex EU documents → EuroVoc concept labels (multi-label). Set output. |
Image Classification (29 datasets)¶
| Dataset | Source | Description |
|---|---|---|
ai_generated_ecommerce |
HuggingFace | AI-generated vs real e-commerce product images. 6K examples. |
beans |
HuggingFace | Beans leaf disease classification; 3 classes: angular leaf spot, bean rust, healthy |
cifar10 |
HuggingFace | CIFAR-10; 10-class 32x32 image classification |
cifar100 |
HuggingFace | CIFAR-100; 100-class 32x32 image classification |
eurosat |
HuggingFace | EuroSAT; land use/cover classification from satellite imagery; 10 classes |
eurosat_rgb |
HuggingFace | EuroSAT RGB: Sentinel-2 satellite image classification (10 land use classes). 16K train examples. |
fashion_mnist |
HuggingFace | Fashion MNIST; 10-class clothing item classification |
food101 |
HuggingFace | Food-101; 101-class food image classification |
graid_bdd |
HuggingFace | GrAID-BDD: grounded autonomous driving image QA. 4.6M train examples. |
gtsrb |
HuggingFace | GTSRB; German Traffic Sign Recognition Benchmark; 43 classes |
handwritten_crossouts |
HuggingFace | Handwritten cross-outs: classify handwriting correction styles. 22K examples. |
imagenet_100 |
HuggingFace | ImageNet-100: 100-class subset of ImageNet. 126K train examples. |
intuitive_physics |
HuggingFace | Intuitive Physics: image-based physical intuition classification. 280K train. |
kvasir_vqa |
HuggingFace | Kvasir-VQA: gastrointestinal endoscopy visual QA. 143K train examples. |
map_trace |
HuggingFace | MapTrace: map-type image classification (7 categories). 20K examples. |
mini_imagenet |
HuggingFace | Mini-ImageNet: 100-class image classification subset of ImageNet. 50K train examples. |
mnist |
Download | The MNIST database of handwritten digits, available from this page, |
mnist_ylecun |
HuggingFace | MNIST handwritten digit recognition. 60K train examples. |
newyorker_caption_contest |
HuggingFace | New Yorker Caption Contest — Multimodal Image+Text Classification |
openfake |
HuggingFace | OpenFake: AI-generated vs real image binary classification. |
oxford_pets |
HuggingFace | Oxford-IIIT Pet; 37 pet breed classification |
path_vqa |
HuggingFace | PathVQA: pathology visual QA (yes/no + open). 20K train examples. |
rendered_sst2 |
HuggingFace | Rendered SST2; sentiment classification from rendered text images |
stanford_cars |
HuggingFace | Stanford Cars; 196-class fine-grained car make/model/year classification |
sun397 |
HuggingFace | SUN397; scene understanding; 397 scene categories |
svhn |
HuggingFace | SVHN; Street View House Numbers digit classification |
tiny_imagenet |
HuggingFace | Tiny ImageNet; 200-class 64x64 subset of ImageNet |
tobacco_document |
HuggingFace | Tobacco document image classification with OCR text. 2.2K examples. |
wikiart |
HuggingFace | WikiArt; artwork style classification across 27 art styles |
Document Understanding / VQA (8 datasets)¶
| Dataset | Source | Description |
|---|---|---|
ai2d_diagrams |
HuggingFace | AI2 Diagrams VQA: diagram image + question → answer. |
cord_v2 |
HuggingFace | CORD v2 receipt understanding: receipt image -> JSON ground truth (text generation). |
docvqa |
HuggingFace | DocVQA: document image + question → answer from the document. |
invoice_data |
HuggingFace | Invoice document understanding: invoice image -> JSON ground truth (text generation). |
merit |
HuggingFace | MERIT: historical document image recognition. 7K train examples. |
textvqa |
HuggingFace | TextVQA: image with text + question → answer requiring reading the text. |
vqa_rad |
HuggingFace | VQA-RAD; radiology visual question answering |
vqav2 |
HuggingFace | VQAv2: image + question → free-form answer. |
Semantic Segmentation (1 datasets)¶
| Dataset | Source | Description |
|---|---|---|
camseq |
Kaggle | CamSeq01 Cambridge Labeled Objects in Video |
Audio Classification (8 datasets)¶
| Dataset | Source | Description |
|---|---|---|
abjad_kids |
HuggingFace | Abjad-Kids: Arabic letter audio classification (28 classes). 40K examples. |
emodb |
HuggingFace | EMo-DB; Berlin Database of Emotional Speech; 7 emotion classes |
esc50 |
Download | ESC-50: Environmental Sound Classification |
minds14 |
HuggingFace | MINDS-14: multilingual banking intent classification from audio. 8K train examples. |
mmsulab |
HuggingFace | MMSuLab: multilingual audio speech laboratory data. 1.9M train examples. |
naturelm_audio |
HuggingFace | NatureLM Audio: wildlife/nature audio instruction following. 26M train examples. |
speech_massive |
HuggingFace | Speech-MASSIVE: multilingual audio intent classification. 23K train examples. |
tadabur |
HuggingFace | Tadabur: Arabic Quran audio with reciter, surah, and ayah classification. 409K examples. |
Speech Recognition (ASR) (7 datasets)¶
| Dataset | Source | Description |
|---|---|---|
ami_asr |
HuggingFace | AMI corpus: meeting audio transcription. 108K train examples. |
cantonese_asr |
HuggingFace | Cantonese speech recognition dataset |
librispeech |
HuggingFace | LibriSpeech; English speech recognition from audiobooks; clean 100h split |
mls_german |
HuggingFace | Multilingual LibriSpeech German ASR |
peoples_speech |
HuggingFace | People's Speech; 30K-hour diverse English ASR |
ravnursson_asr |
HuggingFace | Ravnursson: Faroese audio speech recognition dataset. 65K train examples. |
voxpopuli |
HuggingFace | VoxPopuli; European Parliament speech ASR; 23 languages |
Tabular Classification (12 datasets)¶
| Dataset | Source | Description |
|---|---|---|
bbcnews |
Download | BBC News Classification from Kaggle. |
connect4 |
Kaggle | Each row represents the end results of a Connect-4 game. |
diabetes_readmission |
HuggingFace | Diabetes hospital readmission prediction |
forest_cover |
Download | The Forest Cover Type dataset. |
hermes_function_calling |
HuggingFace | Hermes function calling: tool-use conversations with category labels. 1893 train examples. |
iris |
Download | Iris Dataset |
iris_sklearn |
HuggingFace | Iris dataset: classic 3-class flower classification by petal/sepal measurements. |
mushroom_edibility |
Download | This data set includes descriptions of hypothetical samples corresponding |
otto_group_product |
Download | The Otto Group Product Classification Challenge |
poker_hand |
Download | Each record is an example of a hand consisting of five playing cards |
taix_ray |
HuggingFace | TAIX-Ray: thoracic X-ray study - predict patient sex from age and other metadata. 137K train example |
walmart_recruiting |
Download | Walmart Recruiting: Trip Type Classification |
Tabular Binary Classification (24 datasets)¶
| Dataset | Source | Description |
|---|---|---|
adult_census_income |
Download | Predict whether income exceeds $50K/yr based on census data |
amazon_employee_access_challenge |
Download | There is a considerable amount of data regarding an employee’s role within an organization and the r |
bnp_claims_management |
Download | The BNP Paribas Cardif Claims Management dataset. |
compas_recidivism |
HuggingFace | COMPAS recidivism risk prediction; criminal justice fairness benchmark |
credit_card_default |
HuggingFace | Credit card default prediction from payment history |
creditcard_fraud |
Kaggle | The Machine Learning Group ULB Dataset |
customer_churn_prediction |
Download | Dataset from a Kaggle competition that is about predicting whether a customer will change |
electricity_tabular |
HuggingFace | Electricity market price direction classification |
heart_failure |
HuggingFace | Heart failure clinical records; death event prediction |
higgs |
Download | The Higgs Boson dataset. |
ieee_fraud |
Download | The IEEE-CIS Fraud Detection Dataset |
imbalanced_insurance |
Kaggle | Health Insurance Cross Sell Prediction |
kdd_appetency |
Download | The KDD Cup 2009 Appetency dataset. |
kdd_churn |
Download | The KDD Cup 2009 Churn dataset. |
kdd_upselling |
Download | The KDD Cup 2009 Upselling dataset. |
noshow_appointments |
Kaggle | 110.527 medical appointments its 14 associated variables (characteristics). |
numerai28pt6 |
Kaggle | Encrypted Stock Market Data from Numerai dataset from Kaggle. |
porto_seguro_safe_driver |
Download | Predict the probability that an auto insurance policy holder files a claim. |
santander_customer_satisfaction |
Download | Santander Customer Satisfaction Prediction. |
santander_customer_transaction |
Download | Santander Customer Transaction Prediction. |
synthetic_fraud |
Kaggle | The Synthetic Financial Datasets For Fraud Detection dataset. |
talkingdata_adtrack_fraud |
Download | TalkingData AdTracking Fraud Detection Challenge. |
telco_customer_churn |
Kaggle | The Telco customer churn data contains information about a fictional telco company |
titanic |
Download | The Titanic dataset: use machine learning to create a model |
Tabular Regression (18 datasets)¶
| Dataset | Source | Description |
|---|---|---|
allstate_claims_severity |
Download | Allstate Claims Severity. |
ames_housing |
Download | The Ames Housing dataset. |
california_housing |
Download | California Housing dataset from the 1990 US Census. Predict median house value |
electricity |
Download | Electricity demand dataset. Half-hourly electricity demand in Victoria, Australia during 2014, along |
gaia_cepheids |
HuggingFace | Gaia DR3 Cepheids: predict stellar period from photometry. 15K examples. |
gaia_rrlyrae |
HuggingFace | Gaia DR3 RR Lyrae: predict fundamental period. 271K examples. |
gaia_spectroscopic_binaries |
HuggingFace | Gaia DR3 Spectroscopic Binaries: predict orbital period. 186K examples. |
gaia_young_stellar_objects |
HuggingFace | Gaia DR3 Young Stellar Objects: class prediction from photometry. 79K examples. |
mercedes_benz_greener |
Download | The Mercedes-Benz Greener Manufacturing dataset. |
naval |
Download | Condition Based Maintenance of Naval Propulsion Plants Data Set |
protein |
Download | Physicochemical Properties of Protein Tertiary Structure Data Set. |
repid |
HuggingFace | REPID: Retinal Perceptual Image Dataset — tabular perceptual quality metrics |
rossman_store_sales |
Download | The Rossmann Store Sales dataset. |
santander_value_prediction |
Download | The Santander Value Prediction Challenge dataset. |
sarcos |
Download | The data relates to an inverse dynamics problem for a seven |
stocks_daily_price |
HuggingFace | Daily stock OHLCV: predict close price from symbol and features. 25M examples. |
temperature |
Kaggle | Hourly temperature dataset from Kaggle |
yosemite |
Download | Yosemite temperatures dataset. |
Multimodal (7 datasets)¶
| Dataset | Source | Description |
|---|---|---|
flickr8k |
Download | A new benchmark collection for sentence-based image description and search, |
goodbooks_books |
Download | goodbooks_books is a multimodal dataset of 10K books, taken from the goodreads dataset. |
insurance_lite |
Kaggle | The dataset consists of parameters such as the images of damaged cars, |
mobile_mold |
HuggingFace | MobileMold: smartphone food mold binary image classification. 3.5K examples. |
twitter_bots |
Kaggle | A dataset for Twitter Bot account detection. |
wmt15 |
Kaggle | French/English parallel texts for training translation models. |
world_speech_asr |
HuggingFace | WorldSpeech: multilingual audio quality with CER measurement. 1.2M train examples. |
Special / Utility (1 datasets)¶
| Dataset | Source | Description |
|---|---|---|
hugging_face |
Download | Hugging Face Datasets |
Adding datasets¶
To add a dataset to the Ludwig Dataset Zoo, see Add a Dataset.