Skip to content

Dataset Zoo

The Datasets module provides training datasets that can be directly plugged into a Ludwig model. Datasets can be accessed programmatically by importing the ludwig.datasets module. Each dataset class in the module has a download, process and load function, plus a handy static load function to import from the module itself (i.e. datasets.titanic.load). Calling the module load() function will handle downloading, preprocessing, and loading the dataset into a Pandas DataFrame that Ludwig can use for training. A cache_dir parameter can be used to provide a directory where to download the files (by default it is ~/.ludwig_cache), and each dataset can have additional parameters specific to the way they are structured (i.e. a split parameter to return multiple DataFrames for the different splits of the data, if applicable). When load() is called, the existence of a raw dataset directory is determined and if the data has not yet been downloaded, download() is called, then the existence of a processed dataset directory is determined and if data has not yet been processed process() is called and finally the processed data is loaded in memory.

For example:

from ludwig.datasets import reuters
dataset = reuters.load()

Is equivalent to performing the steps manually:

from ludwig.datasets.reuters import Reuters
dataset = Reuters()
dataset.download()
dataset.process()
dataset_df = dataset.load()

And here is an end-to-end example of training a model using the MNIST dataset:

from ludwig.api import LudwigModel
from ludwig.datasets import mnist

# Initialize a Ludwig model
model = LudwigModel(config)

# Load and split MNIST dataset
training_set, test_set, _ = mnist.load(split=True)

# Run model training
train_stats, _, _ = model.train(
    training_set=training_set,
    test_set=test_set,
    model_name='mnist_model'
)

Currently Available Datasets

Here is the list of the currently available datasets:

  • adult_census_income
  • agnews
  • amazon_review_polarity
  • amazon_reviews
  • ames_housing (hosted on Kaggle)
  • dbpedia
  • electricity
  • ethos_binary
  • fever
  • flickr8k
  • forest_cover
  • goemotions
  • higgs
  • irony
  • kdd_appetency
  • kdd_churn
  • kdd_upselling
  • mnist
  • mushroom_edibility
  • ohsumed
  • poker_hand
  • reuters
  • rossmann_store_sales (hosted on Kaggle)
  • sarcos
  • sst2, sst5, sst3 (a variant obtained my merging very negative and negative, and very positive and positive classes)
  • temperature (hosted on Kaggle)
  • titanic (hosted on Kaggle)
  • yahoo_answers
  • yelp_review_polarity
  • yelp_reviews
  • yosemite

In order to download the datasets hosted on Kaggle, you can either provide credentials through a kaggle_username and kaggle_key parameter to the load() function, or follow the more secure instructions provided in the Python Kaggle Client documentations.