Add a Tokenizer
Tokenizers transform a text string into a sequence of tokens. Ludwig will call the tokenizer for each text column it is specified in, for each row of the dataset during preprocessing. It will then collect a list of unique tokens and assign an integer index to each unique token. This ordered list of unique tokens is called the vocabulary, and will be used by encoders to convert tokens to embeddings and by decoders to convert output predictions to tokens.
A tokenizer is primarily responsible for splitting a string into a list of tokens, and may optionally perform other processing usually for the purpose of reducing the size of the vocabulary. Some examples of processing tokenizers may perform include:
- Splitting on a delimiter ex. underscore "_" or comma ",".
- Removing punctuation characters
- Removing stop words, for example "a", "an", "the" in English.
- Lemmatization: Grouping inflected forms of the same word i.e. "car", "cars", "car's", "cars'" -> "car"
A tokenizer, once registered, can be used to preprocess any text input column by specifying its name as the value of
tokenizer
in the preprocessing
config dictionary:
input_features:
- name: title
type: text
preprocessing:
tokenizer: <NEW_TOKENIZER>
Tokenizers are defined in ludwig/utils/tokenizers.py
. To add a tokenizer, define a new subclass of BaseTokenizer
and
add it to the registry.
1. Add a new tokenizer class¶
Tokenizers may define an optional constructor which can receive arguments from the config. Most tokenizers have no
config parameters, and do not need a constructor. For an example of a tokenizer which uses a parameter in its
constructor, see HFTokenizer
.
The __call__
method does the actual processing, is called with a single string argument, and is expected to return a
list of strings. The tokenizer will be called once for each example in the dataset during preprocessing. Preprocessed
token sequences will be cached on disk in .hdf5 files and re-used in training and validation, thus the tokenizer will
not be called during training.
class NewTokenizer(BaseTokenizer):
def __init__(self, **kwargs):
super().__init__()
# Initialize any variables or state
def __call__(self, text: str) -> List[str]:
# tokenized_text = result of tokenizing
return tokenized_text
2. Add the tokenizer to the registry¶
Tokenizer names are mapped to their implementations in the tokenizer_registry
dictionary at the bottom of
ludwig/utils/tokenizers.py
.
tokenizer_registry = {
"characters": CharactersToListTokenizer,
"space": SpaceStringToListTokenizer,
...
"new_tokenizer": NewTokenizer, # Add your tokenizer as a new entry in the registry.