Dataset Zoo
The Ludwig Dataset Zoo provides datasets that can be directly plugged into a Ludwig model.
The simplest way to use a dataset is to reference it as a URI when specifying the training set:
ludwig train --dataset ludwig://reuters ...
Any Ludwig dataset can be specified as a URI of the form ludwig://<dataset>
.
Datasets can also be programatically imported and loaded into a Pandas DataFrame using the .load()
method:
from ludwig.datasets import reuters
# Loads into single dataframe with a 'split' column:
dataset_df = reuters.load()
# Loads into split dataframes:
train_df, test_df, _ = reuters.load(split=True)
The ludwig.datasets
API also provides functions to list, describe, and get datasets. For example:
import ludwig.datasets
# Gets a list of all available dataset names.
dataset_names = ludwig.datasets.list_datasets()
# Prints the description of the titanic dataset.
print(ludwig.datasets.describe_dataset("titanic"))
titanic = ludwig.datasets.get_dataset("titanic")
# Loads into single dataframe with a 'split' column:
dataset_df = titanic.load()
# Loads into split dataframes:
train_df, test_df, _ = titanic.load(split=True)
Kaggle Datasets¶
Some datasets are hosted on Kaggle and require a kaggle account. To use these, you'll need to set up Kaggle credentials in your environment. If the dataset is part of a Kaggle competition, you'll need to accept the terms on the competition page.
To check programmatically, datasets have an .is_kaggle_dataset
property.
Downloading, Processing, and Exporting¶
Datasets are first downloaded into LUDWIG_CACHE
, which may be set as an environment variable and defaults to
$HOME/.ludwig_cache
.
Datasets are automatically loaded, processed, and re-saved as parquet files in the cache.
To export the processed dataset, including any files it depends on, use the .export(output_directory)
method. This
is recommended if the dataset contains media files like images or audio files. File paths are relative to the working
directory of the training process.
from ludwig.datasets import twitter_bots
# Exports twitter bots dataset and image files to the current working directory.
twitter_bots.export(".")
End-to-end Example¶
Here's an end-to-end example of training a model using the MNIST dataset:
from ludwig.api import LudwigModel
from ludwig.datasets import mnist
# Initializes a Ludwig model
config = {
"input_features": [{"name": "image_path", "type": "image"}],
"output_features": [{"name": "label", "type": "category"}],
}
model = LudwigModel(config)
# Loads and splits MNIST dataset
training_set, test_set, _ = mnist.load(split=True)
# Exports the mnist image files to the current working directory.
mnist.export(".")
# Runs model training
train_stats, _, _ = model.train(training_set=training_set, test_set=test_set, model_name="mnist_model")
Dataset Splits¶
All datasets in the dataset zoo are provided with a default train/validation/test split. When loading with
split=False
, the default split will be returned (and is guaranteed to be the same every time). With split=True
,
Ludwig will randomly re-split the dataset.
Note
Some benchmark or contest datasets are released with held-out test set labels. In other words, the train and validation splits have labels, but the test set does not. Most Kaggle contest datasets have this unlabeled test set.
Splits:
- train: Data to train on. Required, must have labels.
- validation: Subset of dataset to evaluate while training. Optional, must have labels.
- test: Held out from model development, used for later testing. Optional, may not be labeled.
Zoo Datasets¶
Here is the list of the currently available datasets:
Adding datasets¶
To add a dataset to the Ludwig Dataset Zoo, see Add a Dataset.