Skip to content

Text Features

Text Features Preprocessing

Text features are treated in the same way of sequence features, with a couple differences. Two different tokenizations happen, one that splits at every character and one that splits on whitespace and punctuation are used, and two different keys are added to the HDF5 file, one containing the matrix of characters and one containing the matrix of words. The same thing happens in the JSON file, which contains dictionaries for mapping characters to integers (and the inverse) and words to integers (and their inverse). In the configuration you are able to specify which level of representation to use, if the character level or the word level.

The parameters available for preprocessing are:

  • char_tokenizer (default characters): defines how to map from the raw string content of the dataset column to a sequence of characters. The default value and only available option is characters and the behavior is to split the string at each character.
  • char_vocab_file (default null):
  • char_sequence_length_limit (default 1024): the maximum length of the text in characters. Texts that are longer than this value will be truncated, while sequences that are shorter will be padded.
  • char_most_common (default 70): the maximum number of most common characters to be considered. if the data contains more than this amount, the most infrequent characters will be treated as unknown.
  • word_tokenizer (default space_punct): defines how to map from the raw string content of the dataset column to a sequence of elements. For the available options refer to the Tokenizerssection.
  • pretrained_model_name_or_path (default null):
  • word_vocab_file (default null):
  • word_sequence_length_limit (default 256): the maximum length of the text in words. Texts that are longer than this value will be truncated, while texts that are shorter will be padded.
  • word_most_common (default 20000): the maximum number of most common words to be considered. If the data contains more than this amount, the most infrequent words will be treated as unknown.
  • padding_symbol (default <PAD>): the string used as a padding symbol. Is is mapped to the integer ID 0 in the vocabulary.
  • unknown_symbol (default <UNK>): the string used as a unknown symbol. Is is mapped to the integer ID 1 in the vocabulary.
  • padding (default right): the direction of the padding. right and left are available options.
  • lowercase (default false): if the string has to be lowercased before being handled by the tokenizer.
  • missing_value_strategy (default fill_with_const): what strategy to follow when there's a missing value in a binary column. The value should be one of fill_with_const (replaces the missing value with a specific value specified with the fill_value parameter), fill_with_mode (replaces the missing values with the most frequent value in the column), fill_with_mean (replaces the missing values with the mean of the values in the column), backfill (replaces the missing values with the next valid value).
  • fill_value (default ""): the value to replace the missing values with in case the missing_value_strategy is fill-value.

Example of text preprocessing.

name: text_column_name
type: text
level: word
preprocessing:
    char_tokenizer: characters
    char_vocab_file: null
    char_sequence_length_limit: 1024
    char_most_common: 70
    word_tokenizer: space_punct
    pretrained_model_name_or_path: null
    word_vocab_file: null
    word_sequence_length_limit: 256
    word_most_common: 20000
    padding_symbol: <PAD>
    unknown_symbol: <UNK>
    padding: right
    lowercase: false
    missing_value_strategy: fill_with_const
    fill_value: ""

Text Input Features and Encoders

Text input feature parameters are

  • encoder (default parallel_cnn): encoder to use for the input text feature. The available encoders come from Sequence Features and these text specific encoders: bert, gpt, gpt2, xlnet, xlm, roberta, distilbert, ctrl, camembert, albert, t5, xlmroberta, flaubert, electra, longformer and auto-transformer.
  • level (default word): word specifies using text words, char use individual characters.
  • tied_weights (default null): name of the input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.

BERT Encoder

The bert encoder loads a pretrained BERT model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default bert-base-uncased): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default cls_pooled): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: cls_pool, sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

GPT Encoder

The gpt encoder loads a pretrained GPT model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default openai-gpt): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default sum): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

GPT-2 Encoder

The gpt2 encoder loads a pretrained GPT-2 model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default gpt2): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default sum): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

XLNet Encoder

The xlnet encoder loads a pretrained XLNet model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default xlnet-base-cased): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default sum): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

XLM Encoder

The xlm encoder loads a pretrained XLM model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default xlm-mlm-en-2048): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default sum): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

RoBERTa Encoder

The roberta encoder loads a pretrained RoBERTa model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default roberta-base): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default cls_pooled): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: cls_pool, sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

DistilBERT Encoder

The distilbert encoder loads a pretrained DistilBERT model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default istilbert-base-uncased): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default sum): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

CTRL Encoder

The ctrl encoder loads a pretrained CTRL model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default ctrl): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default sum): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

CamemBERT Encoder

The camembert encoder loads a pretrained CamemBERT model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default jplu/tf-camembert-base): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default cls_pooled): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: cls_pool, sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

ALBERT Encoder

The albert encoder loads a pretrained ALBERT model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default albert-base-v2): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default cls_pooled): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: cls_pool, sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

T5 Encoder

The t5 encoder loads a pretrained T5 model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default t5-small): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default sum): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

XLM-RoBERTa Encoder

The xlmroberta encoder loads a pretrained XLM-RoBERTa model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default jplu/tf-xlm-reoberta-base): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default cls_pooled): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: cls_pool, sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

FlauBERT Encoder

The flaubert encoder loads a pretrained FlauBERT model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default jplu/tf-flaubert-base-uncased): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default sum): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

ELECTRA Encoder

The electra encoder loads a pretrained ELECTRA model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default google/electra-small-discriminator): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default sum): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

Longformer Encoder

The longformer encoder loads a pretrained Longformer model using the Hugging Face transformers package.

  • pretrained_model_name_or_path (default allenai/longformer-base-4096): it can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
  • reduce_output (default cls_pooled): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: cls_pool, sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

Auto-Transformer Encoder

The auto_transformer encoder loads a pretrained model using the Hugging Face transformers package. It's the best option for customly trained models that don't fit into the other pretrained transformers encoders.

  • pretrained_model_name_or_path: it can be either the name of a model or a path where it was downloaded. For details on the available models to the Hugging Face documentation.
  • reduce_output (default sum): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
  • trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

Example usage

Example text input feature encoder usage:

name: text_column_name
type: text
level: word
encoder: bert
tied_weights: null
pretrained_model_name_or_path: bert-base-uncased
reduce_output: cls_pooled
trainable: false

Text Output Features and Decoders

The decoders are the same used for the Sequence Features. The only difference is that you can specify an additional level parameter with possible values word or char to force to use the text words or characters as inputs (by default the encoder will use word).

Example text input feature using default values:

name: sequence_column_name
type: text
level: word
decoder: generator
reduce_input: sum
dependencies: []
reduce_dependencies: sum
loss:
    type: softmax_cross_entropy
    confidence_penalty: 0
    robust_lambda: 0
    class_weights: 1
    class_similarities: null
    class_similarities_temperature: 0
    labels_smoothing: 0
    negative_samples: 0
    sampler: null
    distortion: 1
    unique: false
fc_layers: null
num_fc_layers: 0
fc_size: 256
use_bias: true
weights_initializer: glorot_uniform
bias_initializer: zeros
weights_regularizer: null
bias_regularizer: null
activity_regularizer: null
norm: null
norm_params: null
activation: relu
dropout: 0
cell_type: rnn
state_size: 256
embedding_size: 256
beam_width: 1
attention: null
tied_embeddings: null
max_sequence_length: 0

Text Features Measures

The measures are the same used for the Sequence Features.