Preprocessing
The top-level preprocessing
section specifies:
- dataset splitting (train, validation, test)
- data balancing
- type-global preprocessing
Dataset Splitting¶
Split data into train, validation, and test.
By default, Ludwig looks for a column named split
(case-sensitive) which is
expected to consist of 3 possible values that correspond to different datasets:
0
: train1
: validation2
: test
If the data does not contain the split
column, then data is randomly split
based on splitting percentages, defined by split_probabilities
.
If force_split
is true
, the the split column in the dataset is ignored and
the dataset is randomly split based on splitting percentages, defined by
split_probabilities
.
Summary of parameters:
force_split
(defaultfalse
): iftrue
thesplit
column in the dataset file is ignored and the dataset is randomly split. Iffalse
thesplit
column is used if available.split_probabilities
(default[0.7, 0.1, 0.2]
): the proportion of the dataset data to end up in training, validation and test, respectively. The three values must sum to 1.0.stratify
(defaultnull
): ifnull
the split is random, otherwise you can specify the name of acategory
feature and the split will be stratified on that feature.
preprocessing:
force_split: false
split_probabilities: [0.7, 0.1, 0.2]
stratify: null
category: {...}
sequence: {...}
text: {...}
...
Data Balancing¶
Users working with imbalanced datasets can specify an oversampling or undersampling parameter which will balance the data during preprocessing.
In this example, Ludwig will oversample the minority class to achieve a 50% representation in the overall dataset.
preprocessing:
oversample_minority: 0.5
In this example, Ludwig will undersample the majority class to achieve a 70% representation in the overall dataset.
preprocessing:
undersample_majority: 0.7
Data balancing is only supported for binary output features. Additionally, specifying both oversampling and undersampling parameters simultaneously is not supported.
Type-global Preprocessing¶
Specify preprocessing policies that apply globally across all features of a certain data type. For example:
preprocessing:
category:
missing_value_strategy: fill_with_const
fill_value: <UNK>
The preprocessing parameters that each data type accepts can be found in datatype-specific documentation.
Note that different features with the same datatype may require different preprocessing. Type-global preprocessing works in tandem with feature-specific preprocessing configuration parameters, which override the global settings.
For example, a document classification model may have two text input features, one for the title of the document and one for the body.
As the length of the title is much shorter than the length of the body, the parameter word_length_limit
should be set to 10
for the title and 2000
for the body, but we want both features to share the same vocabulary, with most_common_words: 10000
.
The way to do this is adding a preprocessing
key inside the title input_feature
dictionary and one in the body
input feature dictionary containing the desired parameter and value.
preprocessing:
text:
most_common_word: 10000
input_features:
-
name: title
type: text
preprocessing:
word_length_limit: 20
-
name: body
type: text
preprocessing:
word_length_limit: 2000
Feature-specific preprocessing¶
To configure feature-specific preprocessing, please check datatype-specific documentation.
Tokenizers¶
Sequence, text, and set features tokenize features as part of preprocessing. There are several tokenization options that can be specified:
characters
: splits every character of the input string in a separate token.space
: splits on space characters using the regex\s+
.space_punct
: splits on space characters and punctuation using the regex\w+|[^\w\s]
.underscore
: splits on the underscore character_
.comma
: splits on the underscore character,
.untokenized
: treats the whole string as a single token.stripped
: treats the whole string as a single token after removing spaces at the beginning and at the end of the string.hf_tokenizer
: uses the Hugging Face AutoTokenizer which uses apretrained_model_name_or_path
parameter to decide which tokenizer to load.- Language specific tokenizers: spaCy based language tokenizers.
The spaCy based tokenizers are functions that use the powerful tokenization and NLP preprocessing models provided the library.
Several languages are available: English (code en
), Italian (code it
), Spanish (code es
), German (code de
), French (code fr
), Portuguese (code pt
), Dutch (code nl
), Greek (code el
), Chinese (code zh
), Danish (code da
), Dutch (code el
), Japanese (code ja
), Lithuanian (code lt
), Norwegian (code nb
), Polish (code pl
), Romanian (code ro
) and Multi (code xx
, useful in case you have a dataset containing different languages).
For each language different functions are available:
tokenize
: uses spaCy tokenizer,tokenize_filter
: uses spaCy tokenizer and filters out punctuation, numbers, stopwords and words shorter than 3 characters,tokenize_remove_stopwords
: uses spaCy tokenizer and filters out stopwords,lemmatize
: uses spaCy lemmatizer,lemmatize_filter
: uses spaCy lemmatizer and filters out punctuation, numbers, stopwords and words shorter than 3 characters,lemmatize_remove_stopwords
: uses spaCy lemmatize and filters out stopwords.
In order to use these options, you must first download the the spaCy model:
python -m spacy download <language_code>
and provide <language>_<function>
as tokenizer
like: english_tokenizer
, italian_lemmatize_filter
, multi_tokenize_filter
and so on.
More details on the models can be found in the spaCy documentation.