Dataset Zoo
The Datasets module provides training datasets that can be directly plugged into a Ludwig model.
Datasets can be accessed programmatically by importing the ludwig.datasets
module.
Each dataset class in the module has a download
, process
and load
function, plus a handy static load
function to import from the module itself (i.e. datasets.titanic.load
).
Calling the module load()
function will handle downloading, preprocessing, and loading the dataset into a Pandas DataFrame that Ludwig can use for training.
A cache_dir
parameter can be used to provide a directory where to download the files (by default it is ~/.ludwig_cache
), and each dataset can have additional parameters specific to the way they are structured (i.e. a split
parameter to return multiple DataFrames for the different splits of the data, if applicable).
When load()
is called, the existence of a raw dataset directory is determined and if the data has not yet been downloaded, download()
is called, then the existence of a processed dataset directory is determined and if data has not yet been processed process()
is called and finally the processed data is loaded in memory.
For example:
from ludwig.datasets import reuters
dataset = reuters.load()
Is equivalent to performing the steps manually:
from ludwig.datasets.reuters import Reuters
dataset = Reuters()
dataset.download()
dataset.process()
dataset_df = dataset.load()
And here is an end-to-end example of training a model using the MNIST dataset:
from ludwig.api import LudwigModel
from ludwig.datasets import mnist
# Initialize a Ludwig model
model = LudwigModel(config)
# Load and split MNIST dataset
training_set, test_set, _ = mnist.load(split=True)
# Run model training
train_stats, _, _ = model.train(
training_set=training_set,
test_set=test_set,
model_name='mnist_model'
)
Currently Available Datasets¶
Here is the list of the currently available datasets:
adult_census_income
agnews
amazon_review_polarity
amazon_reviews
ames_housing
(hosted on Kaggle)dbpedia
electricity
ethos_binary
fever
flickr8k
forest_cover
goemotions
higgs
irony
kdd_appetency
kdd_churn
kdd_upselling
mnist
mushroom_edibility
ohsumed
poker_hand
reuters
rossmann_store_sales
(hosted on Kaggle)sarcos
sst2
,sst5
,sst3
(a variant obtained my merging very negative and negative, and very positive and positive classes)temperature
(hosted on Kaggle)titanic
(hosted on Kaggle)yahoo_answers
yelp_review_polarity
yelp_reviews
yosemite
In order to download the datasets hosted on Kaggle, you can either provide credentials through a kaggle_username
and kaggle_key
parameter to the load()
function, or follow the more secure instructions provided in the Python Kaggle Client documentations.