Text Features
Text Features Preprocessing¶
Text features are treated in the same way of sequence features, with a couple differences. Two different tokenizations happen, one that splits at every character and one that splits on whitespace and punctuation are used, and two different keys are added to the HDF5 file, one containing the matrix of characters and one containing the matrix of words. The same thing happens in the JSON file, which contains dictionaries for mapping characters to integers (and the inverse) and words to integers (and their inverse). In the configuration you are able to specify which level of representation to use, if the character level or the word level.
The parameters available for preprocessing are:
char_tokenizer
(defaultcharacters
): defines how to map from the raw string content of the dataset column to a sequence of characters. The default value and only available option ischaracters
and the behavior is to split the string at each character.char_vocab_file
(defaultnull
):char_sequence_length_limit
(default1024
): the maximum length of the text in characters. Texts that are longer than this value will be truncated, while sequences that are shorter will be padded.char_most_common
(default70
): the maximum number of most common characters to be considered. if the data contains more than this amount, the most infrequent characters will be treated as unknown.word_tokenizer
(defaultspace_punct
): defines how to map from the raw string content of the dataset column to a sequence of elements. For the available options refer to the Tokenizerssection.pretrained_model_name_or_path
(defaultnull
):word_vocab_file
(defaultnull
):word_sequence_length_limit
(default256
): the maximum length of the text in words. Texts that are longer than this value will be truncated, while texts that are shorter will be padded.word_most_common
(default20000
): the maximum number of most common words to be considered. If the data contains more than this amount, the most infrequent words will be treated as unknown.padding_symbol
(default<PAD>
): the string used as a padding symbol. Is is mapped to the integer ID 0 in the vocabulary.unknown_symbol
(default<UNK>
): the string used as a unknown symbol. Is is mapped to the integer ID 1 in the vocabulary.padding
(defaultright
): the direction of the padding.right
andleft
are available options.lowercase
(defaultfalse
): if the string has to be lowercased before being handled by the tokenizer.missing_value_strategy
(defaultfill_with_const
): what strategy to follow when there's a missing value in a binary column. The value should be one offill_with_const
(replaces the missing value with a specific value specified with thefill_value
parameter),fill_with_mode
(replaces the missing values with the most frequent value in the column),fill_with_mean
(replaces the missing values with the mean of the values in the column),backfill
(replaces the missing values with the next valid value).fill_value
(default""
): the value to replace the missing values with in case themissing_value_strategy
isfill-value
.
Example of text preprocessing.
name: text_column_name
type: text
level: word
preprocessing:
char_tokenizer: characters
char_vocab_file: null
char_sequence_length_limit: 1024
char_most_common: 70
word_tokenizer: space_punct
pretrained_model_name_or_path: null
word_vocab_file: null
word_sequence_length_limit: 256
word_most_common: 20000
padding_symbol: <PAD>
unknown_symbol: <UNK>
padding: right
lowercase: false
missing_value_strategy: fill_with_const
fill_value: ""
Text Input Features and Encoders¶
Text input feature parameters are
encoder
(defaultparallel_cnn
): encoder to use for the input text feature. The available encoders come from Sequence Features and these text specific encoders:bert
,gpt
,gpt2
,xlnet
,xlm
,roberta
,distilbert
,ctrl
,camembert
,albert
,t5
,xlmroberta
,flaubert
,electra
,longformer
andauto-transformer
.level
(defaultword
):word
specifies using text words,char
use individual characters.tied_weights
(defaultnull
): name of the input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.
BERT Encoder¶
The bert
encoder loads a pretrained BERT model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultbert-base-uncased
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultcls_pooled
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:cls_pool
,sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
GPT Encoder¶
The gpt
encoder loads a pretrained GPT model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultopenai-gpt
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultsum
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
GPT-2 Encoder¶
The gpt2
encoder loads a pretrained GPT-2 model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultgpt2
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultsum
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
XLNet Encoder¶
The xlnet
encoder loads a pretrained XLNet model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultxlnet-base-cased
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultsum
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
XLM Encoder¶
The xlm
encoder loads a pretrained XLM model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultxlm-mlm-en-2048
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultsum
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
RoBERTa Encoder¶
The roberta
encoder loads a pretrained RoBERTa model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultroberta-base
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultcls_pooled
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:cls_pool
,sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
DistilBERT Encoder¶
The distilbert
encoder loads a pretrained DistilBERT model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultistilbert-base-uncased
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultsum
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
CTRL Encoder¶
The ctrl
encoder loads a pretrained CTRL model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultctrl
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultsum
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
CamemBERT Encoder¶
The camembert
encoder loads a pretrained CamemBERT model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultjplu/tf-camembert-base
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultcls_pooled
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:cls_pool
,sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
ALBERT Encoder¶
The albert
encoder loads a pretrained ALBERT model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultalbert-base-v2
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultcls_pooled
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:cls_pool
,sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
T5 Encoder¶
The t5
encoder loads a pretrained T5 model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultt5-small
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultsum
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
XLM-RoBERTa Encoder¶
The xlmroberta
encoder loads a pretrained XLM-RoBERTa model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultjplu/tf-xlm-reoberta-base
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultcls_pooled
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:cls_pool
,sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
FlauBERT Encoder¶
The flaubert
encoder loads a pretrained FlauBERT model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultjplu/tf-flaubert-base-uncased
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultsum
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
ELECTRA Encoder¶
The electra
encoder loads a pretrained ELECTRA model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultgoogle/electra-small-discriminator
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultsum
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
Longformer Encoder¶
The longformer
encoder loads a pretrained Longformer model using the Hugging Face transformers package.
pretrained_model_name_or_path
(defaultallenai/longformer-base-4096
): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultcls_pooled
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:cls_pool
,sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
Auto-Transformer Encoder¶
The auto_transformer
encoder loads a pretrained model using the Hugging Face transformers package.
It's the best option for customly trained models that don't fit into the other pretrained transformers encoders.
pretrained_model_name_or_path
: it can be either the name of a model or a path where it was downloaded. For details on the available models to the Hugging Face documentation.reduce_output
(defaultsum
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
Example usage¶
Example text input feature encoder usage:
name: text_column_name
type: text
level: word
encoder: bert
tied_weights: null
pretrained_model_name_or_path: bert-base-uncased
reduce_output: cls_pooled
trainable: false
Text Output Features and Decoders¶
The decoders are the same used for the Sequence Features.
The only difference is that you can specify an additional level
parameter with possible values word
or char
to force to use the text words or characters as inputs (by default the encoder will use word
).
Example text input feature using default values:
name: sequence_column_name
type: text
level: word
decoder: generator
reduce_input: sum
dependencies: []
reduce_dependencies: sum
loss:
type: softmax_cross_entropy
confidence_penalty: 0
robust_lambda: 0
class_weights: 1
class_similarities: null
class_similarities_temperature: 0
labels_smoothing: 0
negative_samples: 0
sampler: null
distortion: 1
unique: false
fc_layers: null
num_fc_layers: 0
fc_size: 256
use_bias: true
weights_initializer: glorot_uniform
bias_initializer: zeros
weights_regularizer: null
bias_regularizer: null
activity_regularizer: null
norm: null
norm_params: null
activation: relu
dropout: 0
cell_type: rnn
state_size: 256
embedding_size: 256
beam_width: 1
attention: null
tied_embeddings: null
max_sequence_length: 0
Text Features Measures¶
The measures are the same used for the Sequence Features.