Supported Formats
File Formats¶
Ludwig is able to read UTF-8 encoded data from 14 file formats. Supported formats are:
- Comma Separated Values (csv)
- Excel Workbooks (excel)
- Feather (feather)
- Fixed Width Format (fwf)
- Hierarchical Data Format 5 (hdf5)
- Hypertext Markup Language (html) Note: limited to single table in the file.
- JavaScript Object Notation (jsonandjsonl)
- Parquet (parquet)
- Pickled Pandas DataFrame (pickle)
- SAS data sets in XPORT or SAS7BDAT format (sas)
- SPSS file (spss)
- Stata file (stata)
- Tab Separated Values (tsv)
Ludwig uses Pandas and Dask under the hood to read the UTF-8 encoded dataset files, which allows support for CSV, Excel, Feather, fwf, HDF5, HTML (containing a <table>), JSON, JSONL, Parquet, pickle (pickled Pandas DataFrame), SAS, SPSS, Stata and TSV formats.
Ludwig tries to automatically identify the format by the extension.
In case a *SV file is provided, Ludwig tries to identify the separator (generally ,) from the data.
The default escape character is \.
For example, if , is the column separator and one of your data columns has a , in it, Pandas would fail to load the data properly.
To handle such cases, we expect the values in the columns to be escaped with backslashes (replace , in the data with \,).
Hugging Face Datasets¶
Ludwig now also supports direct Hugging Face dataset imports with the following syntax (dataset_subset is not always present in Hugging Face datasets, so omit it if necessary).
"hf://{dataset_name}--{dataset_subset}"
For example:
train_stats, _, _ = ludwig_model.train(dataset="hf://mbpp")
train_stats, _, _ = ludwig_model.train(dataset="hf://Open-Orca/OpenOrca")
train_stats, _, _ = ludwig_model.train(dataset="hf://gsm8k--main")
Please note that "subset" is not the same as "split". Make sure that you are including the subset name and not the split name when specifying the dataset:
