Add an Dataset
Source code for datasets lives under ludwig/datasets/
.
Adding a new Dataset¶
Override ludwig.datasets.base_dataset.BaseDataset
and implement the following methods:
@abc.abstractmethod
def download_raw_dataset(self):
raise NotImplementedError()
@abc.abstractmethod
def process_downloaded_dataset(self):
raise NotImplementedError()
@abc.abstractmethod
def load_processed_dataset(self, split):
raise NotImplementedError()
For common steps (e.g., extracting zip files, downloading from Kaggle, etc.) a set of mixins are available for your subclass:
- Mixin
- Mixin
Mixin properites for a specific dataset are configurable within the config.yaml
file in the new dataset module.
These mixins cover most common functionalities that are available in the subclass you are creating, but new mixins can also be added.
Before adding a new mixin or writing code for downloading, processing and loading a new dataset, please check if you can reuse one of the curent mixins.
If not, please consider adding a new mixin if the functionality you need is common among multiple datasets or implement bespoke code in the implementation of the abstract methods of the BaseDataset
subclass.
Please try to mimic the existing unit tests to add new ones for your dataset. Before submitting a new dataset, please test the functionality locally mimicing the already existing examples to be able to load your dataset, split it and call the Ludwig training API to ensure everything runs fine.
For datasets hosted on Kaggle, refer to Python Kaggle Client to see how you can configure your Kaggle credentials locally to download datasets. Also see the internals of the Kaggle DownloadMixin.
Unit Tests for the Datasets API¶
The easiest example of how to extend the Datasets API would be to look at the dataset related unit tests: