Skip to content

Dataset Zoo

The Ludwig Dataset Zoo provides datasets that can be directly plugged into a Ludwig model.

The simplest way to use a dataset is to import and load it. The .load() method will return a Pandas DataFrame.

from ludwig.datasets import reuters

# Loads into single dataframe with a 'split' column:
dataset_df = reuters.load()

# Loads into split dataframes:
train_df, test_df, _ = reuters.load(split=True)

The ludwig.datasets API also provides functions to list, describe, and get datasets. For example:

import ludwig.datasets

# Gets a list of all available dataset names.
dataset_names = ludwig.datasets.list_datasets()

# Prints the description of the titanic dataset.

titanic = ludwig.datasets.get_dataset("titanic")

# Loads into single dataframe with a 'split' column:
dataset_df = titanic.load()

# Loads into split dataframes:
train_df, test_df, _ = titanic.load(split=True)

Kaggle Datasets

Some datasets are hosted on Kaggle and require a kaggle account. To use these, you'll need to set up Kaggle credentials in your environment. If the dataset is part of a Kaggle competition, you'll need to accept the terms on the competition page.

To check programmatically, datasets have an .is_kaggle_dataset property.

Downloading, Processing, and Exporting

Datasets are first downloaded into LUDWIG_CACHE, which may be set as an environment variable and defaults to $HOME/.ludwig_cache.

Datasets are automatically loaded, processed, and re-saved as parquet files in the cache.

To export the processed dataset, including any files it depends on, use the .export(output_directory) method. This is recommended if the dataset contains media files like images or audio files. File paths are relative to the working directory of the training process.

from ludwig.datasets import twitter_bots

# Exports twitter bots dataset and image files to the current working directory.

End-to-end Example

Here's an end-to-end example of training a model using the MNIST dataset:

from ludwig.api import LudwigModel
from ludwig.datasets import mnist

# Initializes a Ludwig model
config = {
    "input_features": [{"name": "image_path", "type": "image"}],
    "output_features": [{"name": "label", "type": "category"}],
model = LudwigModel(config)

# Loads and splits MNIST dataset
training_set, test_set, _ = mnist.load(split=True)

# Exports the mnist image files to the current working directory.

# Runs model training
train_stats, _, _ = model.train(training_set=training_set, test_set=test_set, model_name="mnist_model")

Dataset Splits

All datasets in the dataset zoo are provided with a default train/validation/test split. When loading with split=False, the default split will be returned (and is guaranteed to be the same every time). With split=True, Ludwig will randomly re-split the dataset.


Some benchmark or contest datasets are released with held-out test set labels. In other words, the train and validation splits have labels, but the test set does not. Most Kaggle contest datasets have this unlabeled test set.


  • train: Data to train on. Required, must have labels.
  • validation: Subset of dataset to evaluate while training. Optional, must have labels.
  • test: Held out from model development, used for later testing. Optional, may not be labeled.

Zoo Datasets

Here is the list of the currently available datasets:

Dataset Hosted On Description
adult_census_income Whether a person makes over $50K a year or not.
allstate_claims_severity Kaggle
amazon_employee_access_challenge Kaggle
agnews Github
allstate_claims_severity Kaggle
amazon_employee_access_challenge Kaggle
amazon_review_polarity S3
amazon_reviews S3
ames_housing Kaggle
bbc_news Kaggle
bnp_claims_management Kaggle
connect4 Kaggle
creditcard_fraud Kaggle
dbpedia S3
electricity S3 Predict electricity demand from day of week and outside temperature.
ethos_binary Github
fever S3
flickr8k Github
goemotions Github
ieee_fraud Kaggle
imbalanced_insurance Kaggle
imdb Kaggle
insurance_lite Kaggle
irony Github
noshow_appointments Kaggle
numerai28pt6 Kaggle
ohsumed_7400 Kaggle
otto_group_product Kaggle
porto_seguro_safe_driver Kaggle
reuters_r8 Kaggle Reuters R8 subset of Reuters 21578 dataset from Kaggle.
rossmann_store_sales Kaggle
santander_customer_satisfaction Kaggle
santander_customer_transaction_prediction Kaggle
santander_value_prediction Kaggle
sst3 Merging very negative and negative, and very positive and positive classes.
synthetic_fraud Kaggle
temperature Kaggle
titanic Kaggle
walmart_recruiting Kaggle
wmt15 Kaggle
yahoo_answers S3 Question classification.
yelp_review_polarity S3 Predict the polarity or sentiment of a yelp review.
yelp_reviews S3
yosemite Github Yosemite temperatures dataset.

Adding datasets

To add a dataset to the Ludwig Dataset Zoo, see Add a Dataset.