Dataset Zoo

The Ludwig Dataset Zoo provides datasets that can be directly plugged into a Ludwig model.

The simplest way to use a dataset is to reference it as a URI when specifying the training set:

ludwig train --dataset ludwig://reuters ...

Any Ludwig dataset can be specified as a URI of the form ludwig://<dataset>.

Datasets can also be programatically imported and loaded into a Pandas DataFrame using the .load() method:

from ludwig.datasets import reuters

# Loads into single dataframe with a 'split' column:
dataset_df = reuters.load()

# Loads into split dataframes:
train_df, test_df, _ = reuters.load(split=True)

The ludwig.datasets API also provides functions to list, describe, and get datasets. For example:

import ludwig.datasets

# Gets a list of all available dataset names.
dataset_names = ludwig.datasets.list_datasets()

# Prints the description of the titanic dataset.
print(ludwig.datasets.describe_dataset("titanic"))

titanic = ludwig.datasets.get_dataset("titanic")

# Loads into single dataframe with a 'split' column:
dataset_df = titanic.load()

# Loads into split dataframes:
train_df, test_df, _ = titanic.load(split=True)

Kaggle Datasets¶

Some datasets are hosted on Kaggle and require a kaggle account. To use these, you'll need to set up Kaggle credentials in your environment. If the dataset is part of a Kaggle competition, you'll need to accept the terms on the competition page.

To check programmatically, datasets have an .is_kaggle_dataset property.

Downloading, Processing, and Exporting¶

Datasets are first downloaded into LUDWIG_CACHE, which may be set as an environment variable and defaults to $HOME/.ludwig_cache.

Datasets are automatically loaded, processed, and re-saved as parquet files in the cache.

To export the processed dataset, including any files it depends on, use the .export(output_directory) method. This is recommended if the dataset contains media files like images or audio files. File paths are relative to the working directory of the training process.

from ludwig.datasets import twitter_bots

# Exports twitter bots dataset and image files to the current working directory.
twitter_bots.export(".")

End-to-end Example¶

Here's an end-to-end example of training a model using the MNIST dataset:

from ludwig.api import LudwigModel
from ludwig.datasets import mnist

# Initializes a Ludwig model
config = {
    "input_features": [{"name": "image_path", "type": "image"}],
    "output_features": [{"name": "label", "type": "category"}],
}
model = LudwigModel(config)

# Loads and splits MNIST dataset
training_set, test_set, _ = mnist.load(split=True)

# Exports the mnist image files to the current working directory.
mnist.export(".")

# Runs model training
train_stats, _, _ = model.train(training_set=training_set, test_set=test_set, model_name="mnist_model")

Dataset Splits¶

All datasets in the dataset zoo are provided with a default train/validation/test split. When loading with split=False, the default split will be returned (and is guaranteed to be the same every time). With split=True, Ludwig will randomly re-split the dataset.

Note

Some benchmark or contest datasets are released with held-out test set labels. In other words, the train and validation splits have labels, but the test set does not. Most Kaggle contest datasets have this unlabeled test set.

Splits:

train: Data to train on. Required, must have labels.
validation: Subset of dataset to evaluate while training. Optional, must have labels.
test: Held out from model development, used for later testing. Optional, may not be labeled.

Zoo Datasets¶

Here is the list of the currently available datasets:

Dataset	Hosted On	Description
adult_census_income	archive.ics.uci.edu	https://archive.ics.uci.edu/ml/datasets/adult. Whether a person makes over $50K a year or not.
allstate_claims_severity	Kaggle	https://www.kaggle.com/c/allstate-claims-severity
amazon_employee_access_challenge	Kaggle	https://www.kaggle.com/c/amazon-employee-access-challenge
agnews	Github	https://search.r-project.org/CRAN/refmans/textdata/html/dataset_ag_news.html
allstate_claims_severity	Kaggle	https://www.kaggle.com/c/allstate-claims-severity
amazon_employee_access_challenge	Kaggle	https://www.kaggle.com/c/amazon-employee-access-challenge
amazon_review_polarity	S3	https://paperswithcode.com/sota/sentiment-analysis-on-amazon-review-polarity
amazon_reviews	S3	https://s3.amazonaws.com/amazon-reviews-pds/readme.html
ames_housing	Kaggle	https://www.kaggle.com/c/ames-housing-data
bbc_news	Kaggle	https://www.kaggle.com/c/learn-ai-bbc
bnp_claims_management	Kaggle	https://www.kaggle.com/c/bnp-paribas-cardif-claims-management
connect4	Kaggle	https://www.kaggle.com/c/connectx/discussion/124397
creditcard_fraud	Kaggle	https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
dbpedia	S3	https://paperswithcode.com/dataset/dbpedia
electricity	S3	Predict electricity demand from day of week and outside temperature.
ethos_binary	Github	https://github.com/huggingface/datasets/blob/master/datasets/ethos/README.md
fever	S3	https://arxiv.org/abs/1803.05355
flickr8k	Github	https://www.kaggle.com/adityajn105/flickr8k
forest_cover	archive.ics.uci.edu	https://archive.ics.uci.edu/ml/datasets/covertype
goemotions	Github	https://arxiv.org/abs/2005.00547
higgs	archive.ics.uci.edu	https://archive.ics.uci.edu/ml/datasets/HIGGS
ieee_fraud	Kaggle	https://www.kaggle.com/c/ieee-fraud-detection
imbalanced_insurance	Kaggle	https://www.kaggle.com/datasets/arashnic/imbalanced-data-practice
imdb	Kaggle	https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
insurance_lite	Kaggle	https://www.kaggle.com/infernape/fast-furious-and-insured
iris	archive.ics.uci.edu	https://archive.ics.uci.edu/ml/datasets/iris
irony	Github	https://github.com/bwallace/ACL-2014-irony
kdd_appetency	kdd.org	https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
kdd_churn	kdd.org	https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
kdd_upselling	kdd.org	https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
mnist	yann.lecun.com	http://yann.lecun.com/exdb/mnist/
mushroom_edibility	archive.ics.uci.edu	https://archive.ics.uci.edu/ml/datasets/mushroom
naval	archive.ics.uci.edu	https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/24098
noshow_appointments	Kaggle	https://www.kaggle.com/datasets/joniarroba/noshowappointments
numerai28pt6	Kaggle	https://www.kaggle.com/numerai/encrypted-stock-market-data-from-numerai
ohsumed_7400	Kaggle	https://www.kaggle.com/datasets/weipengfei/ohr8r52
ohsumed_cmu	boston.lti.cs.cmu.edu	http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW2/
otto_group_product	Kaggle	https://www.kaggle.com/c/otto-group-product-classification-challenge
poker_hand	archive.ics.uci.edu	https://archive.ics.uci.edu/ml/datasets/Poker+Hand
porto_seguro_safe_driver	Kaggle	https://www.kaggle.com/c/porto-seguro-safe-driver-prediction
protein	archive.ics.uci.edu	https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2932-0
reuters_cmu	boston.lti.cs.cmu.edu	http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW2/
reuters_r8	Kaggle	Reuters R8 subset of Reuters 21578 dataset from Kaggle.
rossmann_store_sales	Kaggle	https://www.kaggle.com/c/rossmann-store-sales
santander_customer_satisfaction	Kaggle	https://www.kaggle.com/c/santander-customer-satisfaction
santander_customer_transaction_prediction	Kaggle	https://www.kaggle.com/c/santander-customer-transaction-prediction
santander_value_prediction	Kaggle	https://www.kaggle.com/c/santander-value-prediction-challenge
sarcos	gaussianprocess.org	http://www.gaussianprocess.org/gpml/data/
sst2	nlp.stanford.edu	https://paperswithcode.com/dataset/sst
sst3	nlp.stanford.edu	Merging very negative and negative, and very positive and positive classes.
sst5	nlp.stanford.edu	https://paperswithcode.com/dataset/sst
synthetic_fraud	Kaggle	https://www.kaggle.com/ealaxi/paysim1
temperature	Kaggle	https://www.kaggle.com/selfishgene/historical-hourly-weather-data
titanic	Kaggle	https://www.kaggle.com/c/titanic
walmart_recruiting	Kaggle	https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting
wmt15	Kaggle	https://www.kaggle.com/dhruvildave/en-fr-translation-dataset
yahoo_answers	S3	Question classification.
yelp_review_polarity	S3	https://www.yelp.com/dataset. Predict the polarity or sentiment of a yelp review.
yelp_reviews	S3	https://www.yelp.com/dataset
yosemite	Github	https://github.com/ourownstory/neural_prophet Yosemite temperatures dataset.

Adding datasets¶

To add a dataset to the Ludwig Dataset Zoo, see Add a Dataset.