⇅ Text Features

Preprocessing¶

Text features are an extension of sequence features. Text inputs are processed by a tokenizer which maps the raw text input into a sequence of tokens. An integer id is assigned to each unique token. Using this mapping, each text string is converted first to a sequence of tokens, and next to a sequence of integers.

The list of tokens and their integer representations (vocabulary) is stored in the metadata of the model. In the case of a text output feature, this same mapping is used to post-process predictions to text.

preprocessing:
    tokenizer: space_punct
    max_sequence_length: 256
    missing_value_strategy: fill_with_const
    most_common: 20000
    lowercase: false
    fill_value: <UNK>
    ngram_size: 2
    padding_symbol: <PAD>
    unknown_symbol: <UNK>
    padding: right
    cache_encoder_embeddings: false
    vocab_file: null
    sequence_length: null
    prompt:
        template: null
        task: null
        retrieval:
            type: null
            index_name: null
            model_name: null
            k: 0

Parameters:

tokenizer (default: space_punct) : Defines how to map from the raw string content of the dataset column to a sequence of elements. Options: space, space_punct, ngram, characters, underscore, comma, untokenized, stripped, english_tokenize, english_tokenize_filter, english_tokenize_remove_stopwords, english_lemmatize, english_lemmatize_filter, english_lemmatize_remove_stopwords, italian_tokenize, italian_tokenize_filter, italian_tokenize_remove_stopwords, italian_lemmatize, italian_lemmatize_filter, italian_lemmatize_remove_stopwords, spanish_tokenize, spanish_tokenize_filter, spanish_tokenize_remove_stopwords, spanish_lemmatize, spanish_lemmatize_filter, spanish_lemmatize_remove_stopwords, german_tokenize, german_tokenize_filter, german_tokenize_remove_stopwords, german_lemmatize, german_lemmatize_filter, german_lemmatize_remove_stopwords, french_tokenize, french_tokenize_filter, french_tokenize_remove_stopwords, french_lemmatize, french_lemmatize_filter, french_lemmatize_remove_stopwords, portuguese_tokenize, portuguese_tokenize_filter, portuguese_tokenize_remove_stopwords, portuguese_lemmatize, portuguese_lemmatize_filter, portuguese_lemmatize_remove_stopwords, dutch_tokenize, dutch_tokenize_filter, dutch_tokenize_remove_stopwords, dutch_lemmatize, dutch_lemmatize_filter, dutch_lemmatize_remove_stopwords, greek_tokenize, greek_tokenize_filter, greek_tokenize_remove_stopwords, greek_lemmatize, greek_lemmatize_filter, greek_lemmatize_remove_stopwords, norwegian_tokenize, norwegian_tokenize_filter, norwegian_tokenize_remove_stopwords, norwegian_lemmatize, norwegian_lemmatize_filter, norwegian_lemmatize_remove_stopwords, lithuanian_tokenize, lithuanian_tokenize_filter, lithuanian_tokenize_remove_stopwords, lithuanian_lemmatize, lithuanian_lemmatize_filter, lithuanian_lemmatize_remove_stopwords, danish_tokenize, danish_tokenize_filter, danish_tokenize_remove_stopwords, danish_lemmatize, danish_lemmatize_filter, danish_lemmatize_remove_stopwords, polish_tokenize, polish_tokenize_filter, polish_tokenize_remove_stopwords, polish_lemmatize, polish_lemmatize_filter, polish_lemmatize_remove_stopwords, romanian_tokenize, romanian_tokenize_filter, romanian_tokenize_remove_stopwords, romanian_lemmatize, romanian_lemmatize_filter, romanian_lemmatize_remove_stopwords, japanese_tokenize, japanese_tokenize_filter, japanese_tokenize_remove_stopwords, japanese_lemmatize, japanese_lemmatize_filter, japanese_lemmatize_remove_stopwords, chinese_tokenize, chinese_tokenize_filter, chinese_tokenize_remove_stopwords, chinese_lemmatize, chinese_lemmatize_filter, chinese_lemmatize_remove_stopwords, multi_tokenize, multi_tokenize_filter, multi_tokenize_remove_stopwords, multi_lemmatize, multi_lemmatize_filter, multi_lemmatize_remove_stopwords, sentencepiece, clip, gpt2bpe, bert, hf_tokenizer.
max_sequence_length (default: 256) : The maximum length (number of tokens) of the sequence. Sequences that are longer than this value will be truncated. Useful as a stopgap measure if sequence_length is set to None. If None, max sequence length will be inferred from the training dataset.
missing_value_strategy (default: fill_with_const) : What strategy to follow when there's a missing value in a text column. Options: fill_with_const, fill_with_mode, bfill, ffill, drop_row. See Missing Value Strategy for details.
most_common (default: 20000): The maximum number of most common tokens in the vocabulary. If the data contains more than this amount, the most infrequent symbols will be treated as unknown.
lowercase (default: false): If true, converts the string to lowercase before tokenizing. Options: true, false.
fill_value (default: <UNK>): The value to replace missing values with in case the missing_value_strategy is fill_with_const.
ngram_size (default: 2): The size of the ngram when using the ngram tokenizer (e.g, 2 = bigram, 3 = trigram, etc.).
padding_symbol (default: <PAD>): The string used as the padding symbol for sequence features. Ignored for features using huggingface encoders, which have their own vocabulary.
unknown_symbol (default: <UNK>): The string used as the unknown symbol for sequence features. Ignored for features using huggingface encoders, which have their own vocabulary.
padding (default: right): The direction of the padding. Options: left, right.
cache_encoder_embeddings (default: false): For pretrained encoders, compute encoder embeddings in preprocessing, speeding up training time considerably. Only supported when encoder.trainable=false. Options: true, false.
vocab_file (default: null): Filepath string to a UTF-8 encoded file containing the sequence's vocabulary. On each line the first string until \t or \n is considered a word.
sequence_length (default: null): The desired length (number of tokens) of the sequence. Sequences that are longer than this value will be truncated and sequences shorter than this value will be padded. If None, sequence length will be inferred from the training dataset.
prompt :
prompt.template (default: null) : The template to use for the prompt. Must contain at least one of the columns from the input dataset or __sample__ as a variable surrounded in curly brackets {} to indicate where to insert the current feature. Multiple columns can be inserted, e.g.: The {color} {animal} jumped over the {size} {object}, where every term in curly brackets is a column in the dataset. If a task is specified, then the template must also contain the __task__ variable. If retrieval is specified, then the template must also contain the __context__ variable. If no template is provided, then a default will be used based on the retrieval settings, and a task must be set in the config.
prompt.task (default: null) : The task to use for the prompt. Required if template is not set.
prompt.retrieval (default: {"type": null}):

Preprocessing parameters can also be defined once and applied to all text input features using the Type-Global Preprocessing section.

Note

If a text feature's encoder specifies a huggingface model, then the tokenizer for that model will be used automatically.

Input Features¶

The encoder parameters specified at the feature level are:

tied (default null): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.

Example text feature entry in the input features list:

name: text_column_name
type: text
tied: null
encoder:
    type: bert
    trainable: true

Parameters:

type (default parallel_cnn): encoder to use for the input text feature. The available encoders include encoders used for Sequence Features as well as pre-trained text encoders from the face transformers library: albert, auto_transformer, bert, camembert, ctrl, distilbert, electra, flaubert, gpt, gpt2, longformer, roberta, t5, mt5, transformer_xl, xlm, xlmroberta, xlnet.

Encoder type and encoder parameters can also be defined once and applied to all text input features using the Type-Global Encoder section.

Encoders¶

Embed Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["Aggregation\n Reduce\n Operation"];
  C --> ...;

The embed encoder simply maps each token in the input sequence to an embedding, creating a b x s x h tensor where b is the batch size, s is the length of the sequence and h is the embedding size. The tensor is reduced along the s dimension to obtain a single vector of size h for each element of the batch. If you want to output the full b x s x h tensor, you can specify reduce_output: null.

encoder:
    type: embed
    dropout: 0.0
    embedding_size: 256
    representation: dense
    weights_initializer: uniform
    reduce_output: sum
    embeddings_on_cpu: false
    embeddings_trainable: true
    pretrained_embeddings: null

Parameters:

dropout (default: 0.0) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
weights_initializer (default: uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
reduce_output (default: sum): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

Parallel CNN Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  C --> D1["1D Conv\n Width 2"] --> E1["Pool"];
  C --> D2["1D Conv\n Width 3"] --> E2["Pool"];
  C --> D3["1D Conv\n Width 4"] --> E3["Pool"];
  C --> D4["1D Conv\n Width 5"] --> E4["Pool"];
  E1 --> F["Concat"] --> G["Fully\n Connected\n Layers"] --> H["..."];
  E2 --> F;
  E3 --> F;
  E4 --> F;

The parallel cnn encoder is inspired by Yoon Kim's Convolutional Neural Network for Sentence Classification. It works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a number of parallel 1d convolutional layers with different filter size (by default 4 layers with filter size 2, 3, 4 and 5), followed by max pooling and concatenation. This single vector concatenating the outputs of the parallel convolutional layers is then passed through a stack of fully connected layers and returned as a b x h tensor where h is the output size of the last fully connected layer. If you want to output the full b x s x h tensor, you can specify reduce_output: null.

encoder:
    type: parallel_cnn
    dropout: 0.0
    embedding_size: 256
    num_conv_layers: null
    output_size: 256
    activation: relu
    filter_size: 3
    norm: null
    representation: dense
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: sum
    conv_layers: null
    pool_function: max
    pool_size: null
    norm_params: null
    num_fc_layers: null
    fc_layers: null
    pretrained_embeddings: null
    num_filters: 256

Parameters:

dropout (default: 0.0) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
num_conv_layers (default: null) : The number of stacked convolutional layers when conv_layers is null.
output_size (default: 256) : The default output_size that will be used for each layer.
activation (default: relu): The default activation function that will be used for each layer. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
filter_size (default: 3): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.
norm (default: null): The default norm that will be used for each layer. Options: batch, layer, null.
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
use_bias (default: true): Whether to use a bias vector. Options: true, false.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.
reduce_output (default: sum): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
conv_layers (default: null): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, num_filters, filter_size, strides, padding, dilation_rate, use_bias, pool_function, pool_padding, pool_size, pool_strides, bias_initializer, weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If both conv_layers and num_conv_layers are null, a default list will be assigned to conv_layers with the value [{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}].
pool_function (default: max): Pooling function to use. max will select the maximum value. Any of average, avg, or mean will compute the mean value Options: last, sum, mean, avg, max, concat, attention, none, None, null.
pool_size (default: null): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the s sequence dimension after the convolution operation.
norm_params (default: null): Parameters used if norm is either batch or layer.
num_fc_layers (default: null): Number of parallel fully connected layers to use.
fc_layers (default: null): List of dictionaries containing the parameters for each fully connected layer.
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.
num_filters (default: 256): Number of filters, and by consequence number of output channels of the 1d convolution.

Stacked CNN Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["1D Conv Layers\n Different Widths"];
  C --> D["Fully\n Connected\n Layers"];
  D --> ...;

The stacked cnn encoder is inspired by Xiang Zhang at all's Character-level Convolutional Networks for Text Classification. It works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of 1d convolutional layers with different filter size (by default 6 layers with filter size 7, 7, 3, 3, 3 and 3), followed by an optional final pool and by a flatten operation. This single flatten vector is then passed through a stack of fully connected layers and returned as a b x h tensor where h is the output size of the last fully connected layer. If you want to output the full b x s x h tensor, you can specify the pool_size of all your conv_layers to be null and reduce_output: null, while if pool_size has a value different from null and reduce_output: null the returned tensor will be of shape b x s' x h, where s' is width of the output of the last convolutional layer.

encoder:
    type: stacked_cnn
    dropout: 0.0
    num_conv_layers: null
    embedding_size: 256
    output_size: 256
    activation: relu
    filter_size: 3
    strides: 1
    norm: null
    representation: dense
    conv_layers: null
    pool_function: max
    pool_size: null
    dilation_rate: 1
    pool_strides: null
    pool_padding: same
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: sum
    norm_params: null
    num_fc_layers: null
    fc_layers: null
    num_filters: 256
    padding: same
    pretrained_embeddings: null

Parameters:

dropout (default: 0.0) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
num_conv_layers (default: null) : The number of stacked convolutional layers when conv_layers is null.
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
output_size (default: 256) : The default output_size that will be used for each layer.
activation (default: relu): The default activation function that will be used for each layer. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
filter_size (default: 3): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.
strides (default: 1): Stride length of the convolution.
norm (default: null): The default norm that will be used for each layer. Options: batch, layer, null.
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
conv_layers (default: null): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, num_filters, filter_size, strides, padding, dilation_rate, use_bias, pool_function, pool_padding, pool_size, pool_strides, bias_initializer, weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If both conv_layers and num_conv_layers are null, a default list will be assigned to conv_layers with the value [{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}].
pool_function (default: max): Pooling function to use. max will select the maximum value. Any of average, avg, or mean will compute the mean value Options: last, sum, mean, avg, max, concat, attention, none, None, null.
pool_size (default: null): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the s sequence dimension after the convolution operation.
dilation_rate (default: 1): Dilation rate to use for dilated convolution.
pool_strides (default: null): Factor to scale down.
pool_padding (default: same): Padding to use. Options: valid, same.
use_bias (default: true): Whether to use a bias vector. Options: true, false.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.
reduce_output (default: sum): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
norm_params (default: null): Parameters used if norm is either batch or layer.
num_fc_layers (default: null): Number of parallel fully connected layers to use.
fc_layers (default: null): List of dictionaries containing the parameters for each fully connected layer.
num_filters (default: 256): Number of filters, and by consequence number of output channels of the 1d convolution.
padding (default: same): Padding to use. Options: valid, same.
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

Stacked Parallel CNN Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  C --> D1["1D Conv\n Width 2"] --> E["Concat"];
  C --> D2["1D Conv\n Width 3"] --> E;
  C --> D3["1D Conv\n Width 4"] --> E;
  C --> D4["1D Conv\n Width 5"] --> E;
  E --> F["..."];
  F --> G1["1D Conv\n Width 2"] --> H["Concat"];
  F --> G2["1D Conv\n Width 3"] --> H;
  F --> G3["1D Conv\n Width 4"] --> H;
  F --> G4["1D Conv\n Width 5"] --> H;
  H --> I["Pool"] --> J["Fully\n Connected\n Layers"] --> K["..."];

The stacked parallel cnn encoder is a combination of the Parallel CNN and the Stacked CNN encoders where each layer of the stack is composed of parallel convolutional layers. It works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of several parallel 1d convolutional layers with different filter size, followed by an optional final pool and by a flatten operation. This single flattened vector is then passed through a stack of fully connected layers and returned as a b x h tensor where h is the output size of the last fully connected layer. If you want to output the full b x s x h tensor, you can specify reduce_output: null.

encoder:
    type: stacked_parallel_cnn
    dropout: 0.0
    embedding_size: 256
    output_size: 256
    activation: relu
    filter_size: 3
    norm: null
    representation: dense
    num_stacked_layers: null
    pool_function: max
    pool_size: null
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: sum
    norm_params: null
    num_fc_layers: null
    fc_layers: null
    stacked_layers: null
    num_filters: 256
    pretrained_embeddings: null

Parameters:

dropout (default: 0.0) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
output_size (default: 256) : The default output_size that will be used for each layer.
activation (default: relu): The default activation function that will be used for each layer. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
filter_size (default: 3): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.
norm (default: null): The default norm that will be used for each layer. Options: batch, layer, null.
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
num_stacked_layers (default: null): If stacked_layers is null, this is the number of elements in the stack of parallel convolutional layers.
pool_function (default: max): Pooling function to use. max will select the maximum value. Any of average, avg, or mean will compute the mean value Options: last, sum, mean, avg, max, concat, attention, none, None, null.
pool_size (default: null): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the s sequence dimension after the convolution operation.
use_bias (default: true): Whether to use a bias vector. Options: true, false.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.
reduce_output (default: sum): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
norm_params (default: null): Parameters used if norm is either batch or layer.
num_fc_layers (default: null): Number of parallel fully connected layers to use.
fc_layers (default: null): List of dictionaries containing the parameters for each fully connected layer.
stacked_layers (default: null): a nested list of lists of dictionaries containing the parameters of the stack of parallel convolutional layers. The length of the list determines the number of stacked parallel convolutional layers, length of the sub-lists determines the number of parallel conv layers and the content of each dictionary determines the parameters for a specific layer.
num_filters (default: 256): Number of filters, and by consequence number of output channels of the 1d convolution.
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

RNN Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["RNN Layers"];
  C --> D["Fully\n Connected\n Layers"];
  D --> ...;

The rnn encoder works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of recurrent layers (by default 1 layer), followed by a reduce operation that by default only returns the last output, but can perform other reduce functions. If you want to output the full b x s x h where h is the size of the output of the last rnn layer, you can specify reduce_output: null.

encoder:
    type: rnn
    dropout: 0.0
    cell_type: rnn
    num_layers: 1
    state_size: 256
    embedding_size: 256
    output_size: 256
    norm: null
    num_fc_layers: 0
    fc_dropout: 0.0
    recurrent_dropout: 0.0
    activation: tanh
    fc_activation: relu
    recurrent_activation: sigmoid
    representation: dense
    unit_forget_bias: true
    recurrent_initializer: orthogonal
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: last
    norm_params: null
    fc_layers: null
    bidirectional: false
    pretrained_embeddings: null

Parameters:

dropout (default: 0.0) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
cell_type (default: rnn) : The type of recurrent cell to use. Available values are: rnn, lstm, gru. For reference about the differences between the cells please refer to torch.nn Recurrent Layers. Options: rnn, lstm, gru.
num_layers (default: 1) : The number of stacked recurrent layers.
state_size (default: 256) : The size of the state of the rnn.
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
output_size (default: 256) : The default output_size that will be used for each layer.
norm (default: null) : The default norm that will be used for each layer. Options: batch, layer, ghost, null.
num_fc_layers (default: 0) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
recurrent_dropout (default: 0.0): The dropout rate for the recurrent state
activation (default: tanh): The default activation function. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
recurrent_activation (default: sigmoid): The activation function to use in the recurrent step Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
unit_forget_bias (default: true): If true, add 1 to the bias of the forget gate at initialization Options: true, false.
recurrent_initializer (default: orthogonal): The initializer for recurrent matrix weights Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity.
use_bias (default: true): Whether to use a bias vector. Options: true, false.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.
reduce_output (default: last): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
norm_params (default: null): Default parameters passed to the norm module.
fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
bidirectional (default: false): If true, two recurrent networks will perform encoding in the forward and backward direction and their outputs will be concatenated. Options: true, false.
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

CNN RNN Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C1["CNN Layers"];
  C1 --> C2["RNN Layers"];
  C2 --> D["Fully\n Connected\n Layers"];
  D --> ...;

The cnnrnn encoder works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of convolutional layers (by default 2), that is followed by a stack of recurrent layers (by default 1), followed by a reduce operation that by default only returns the last output, but can perform other reduce functions. If you want to output the full b x s x h where h is the size of the output of the last rnn layer, you can specify reduce_output: null.

encoder:
    type: cnnrnn
    dropout: 0.0
    conv_dropout: 0.0
    cell_type: rnn
    num_conv_layers: null
    state_size: 256
    embedding_size: 256
    output_size: 256
    norm: null
    num_fc_layers: 0
    fc_dropout: 0.0
    recurrent_dropout: 0.0
    activation: tanh
    filter_size: 5
    strides: 1
    fc_activation: relu
    recurrent_activation: sigmoid
    conv_activation: relu
    representation: dense
    conv_layers: null
    pool_function: max
    pool_size: null
    dilation_rate: 1
    pool_strides: null
    pool_padding: same
    unit_forget_bias: true
    recurrent_initializer: orthogonal
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: last
    norm_params: null
    fc_layers: null
    num_filters: 256
    padding: same
    num_rec_layers: 1
    bidirectional: false
    pretrained_embeddings: null

Parameters:

dropout (default: 0.0) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
conv_dropout (default: 0.0) : The dropout rate for the convolutional layers
cell_type (default: rnn) : The type of recurrent cell to use. Available values are: rnn, lstm, gru. For reference about the differences between the cells please refer to torch.nn Recurrent Layers. Options: rnn, lstm, gru.
num_conv_layers (default: null) : The number of stacked convolutional layers when conv_layers is null.
state_size (default: 256) : The size of the state of the rnn.
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
output_size (default: 256) : The default output_size that will be used for each layer.
norm (default: null) : The default norm that will be used for each layer. Options: batch, layer, ghost, null.
num_fc_layers (default: 0) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
recurrent_dropout (default: 0.0): The dropout rate for the recurrent state
activation (default: tanh): The default activation function to use. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
filter_size (default: 5): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.
strides (default: 1): Stride length of the convolution.
fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
recurrent_activation (default: sigmoid): The activation function to use in the recurrent step Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
conv_activation (default: relu): The default activation function that will be used for each convolutional layer. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
conv_layers (default: null): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, num_filters, filter_size, strides, padding, dilation_rate, use_bias, pool_function, pool_padding, pool_size, pool_strides, bias_initializer, weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If both conv_layers and num_conv_layers are null, a default list will be assigned to conv_layers with the value [{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}].
pool_function (default: max): Pooling function to use. max will select the maximum value. Any of average, avg, or mean will compute the mean value Options: last, sum, mean, avg, max, concat, attention, none, None, null.
pool_size (default: null): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the s sequence dimension after the convolution operation.
dilation_rate (default: 1): Dilation rate to use for dilated convolution.
pool_strides (default: null): Factor to scale down.
pool_padding (default: same): Padding to use. Options: valid, same.
unit_forget_bias (default: true): If true, add 1 to the bias of the forget gate at initialization Options: true, false.
recurrent_initializer (default: orthogonal): The initializer for recurrent matrix weights Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity.
use_bias (default: true): Whether to use a bias vector. Options: true, false.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.
reduce_output (default: last): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
norm_params (default: null): Default parameters passed to the norm module.
fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
num_filters (default: 256): Number of filters, and by consequence number of output channels of the 1d convolution.
padding (default: same): Padding to use. Options: valid, same.
num_rec_layers (default: 1): The number of stacked recurrent layers.
bidirectional (default: false): If true, two recurrent networks will perform encoding in the forward and backward direction and their outputs will be concatenated. Options: true, false.
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

Transformer Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["Transformer\n Blocks"];
  C --> D["Fully\n Connected\n Layers"];
  D --> ...;

The transformer encoder implements a stack of transformer blocks, replicating the architecture introduced in the Attention is all you need paper, and adds am optional stack of fully connected layers at the end.

encoder:
    type: transformer
    dropout: 0.1
    num_layers: 1
    embedding_size: 256
    output_size: 256
    norm: null
    num_fc_layers: 0
    fc_dropout: 0.0
    hidden_size: 256
    transformer_output_size: 256
    fc_activation: relu
    representation: dense
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: last
    norm_params: null
    fc_layers: null
    num_heads: 8
    pretrained_embeddings: null

Parameters:

dropout (default: 0.1) : The dropout rate for the transformer block. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
num_layers (default: 1) : The number of transformer layers.
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
output_size (default: 256) : The default output_size that will be used for each layer.
norm (default: null) : The default norm that will be used for each layer. Options: batch, layer, ghost, null.
num_fc_layers (default: 0) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
hidden_size (default: 256): The size of the hidden representation within the transformer block. It is usually the same as the embedding_size, but if the two values are different, a projection layer will be added before the first transformer block.
transformer_output_size (default: 256): Size of the fully connected layer after self attention in the transformer block. This is usually the same as hidden_size and embedding_size.
fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
use_bias (default: true): Whether to use a bias vector. Options: true, false.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.
reduce_output (default: last): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
norm_params (default: null): Default parameters passed to the norm module.
fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
num_heads (default: 8): Number of attention heads in each transformer block.
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

Huggingface encoders¶

All huggingface-based text encoders are configured with the following parameters:

pretrained_model_name_or_path (default is the huggingface default model path for the specified encoder, i.e. bert-base-uncased for BERT). This can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
reduce_output (default cls_pooled): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: cls_pooled, sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

Note

Any hyperparameter of any huggingface encoder can be overridden. Check the huggingface documentation for which parameters are used for which models.

name: text_column_name
type: text
encoder: bert
trainable: true
num_attention_heads: 16 # Instead of 12

AutoTransformer¶

The auto_transformer encoder automatically instantiates the model architecture for the specified pretrained_model_name_or_path. Unlike the other HF encoders, auto_transformer does not provide a default value for pretrained_model_name_or_path, this is its only mandatory parameter. See the Hugging Face AutoModels documentation for more details.

encoder:
    type: auto_transformer
    pretrained_model_name_or_path: bert
    trainable: false
    reduce_output: sum
    pretrained_kwargs: null
    adapter: null

Parameters:

pretrained_model_name_or_path (default: null) : Name or path of the pretrained model.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.
adapter (default: null): Whether to use parameter-efficient fine-tuning

ALBERT¶

The albert encoder loads a pretrained ALBERT (default albert-base-v2) model using the Hugging Face transformers package. Albert is similar to BERT, with significantly lower memory usage and somewhat faster training time:.

encoder:
    type: albert
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    reduce_output: cls_pooled
    embedding_size: 128
    hidden_size: 768
    num_hidden_layers: 12
    num_hidden_groups: 1
    num_attention_heads: 12
    intermediate_size: 3072
    inner_group_num: 1
    hidden_act: gelu_new
    hidden_dropout_prob: 0.0
    attention_probs_dropout_prob: 0.0
    max_position_embeddings: 512
    type_vocab_size: 2
    initializer_range: 0.02
    layer_norm_eps: 1.0e-12
    classifier_dropout_prob: 0.1
    position_embedding_type: absolute
    pad_token_id: 0
    bos_token_id: 2
    eos_token_id: 3
    pretrained_kwargs: null
    adapter: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: albert-base-v2): Name or path of the pretrained model.
reduce_output (default: cls_pooled): The method used to reduce a sequence of tensors down to a single tensor.
embedding_size (default: 128): Dimensionality of vocabulary embeddings.
hidden_size (default: 768): Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (default: 12): Number of hidden layers in the Transformer encoder.
num_hidden_groups (default: 1): Number of groups for the hidden layers, parameters in the same group are shared.
num_attention_heads (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (default: 3072): The dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
inner_group_num (default: 1): The number of inner repetition of attention and ffn.
hidden_act (default: gelu_new): The non-linear activation function (function or string) in the encoder and pooler. Options: gelu, relu, silu, gelu_new.
hidden_dropout_prob (default: 0.0): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (default: 0.0): The dropout ratio for the attention probabilities.
max_position_embeddings (default: 512): The maximum sequence length that this model might ever be used with. Typically set this to something large (e.g., 512 or 1024 or 2048).
type_vocab_size (default: 2): The vocabulary size of the token_type_ids passed when calling AlbertModel or TFAlbertModel.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (default: 1e-12): The epsilon used by the layer normalization layers.
classifier_dropout_prob (default: 0.1): The dropout ratio for attached classifiers.
position_embedding_type (default: absolute): Options: absolute, relative_key, relative_key_query.
pad_token_id (default: 0): The ID of the token to use as padding.
bos_token_id (default: 2): The beginning of sequence token ID.
eos_token_id (default: 3): The end of sequence token ID.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.
adapter (default: null): Whether to use parameter-efficient fine-tuning

BERT¶

The bert encoder loads a pretrained BERT (default bert-base-uncased) model using the Hugging Face transformers package. BERT is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.

encoder:
    type: bert
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 512
    classifier_dropout: null
    reduce_output: cls_pooled
    hidden_size: 768
    num_hidden_layers: 12
    num_attention_heads: 12
    intermediate_size: 3072
    hidden_act: gelu
    type_vocab_size: 2
    initializer_range: 0.02
    layer_norm_eps: 1.0e-12
    pad_token_id: 0
    gradient_checkpointing: false
    position_embedding_type: absolute
    pretrained_kwargs: null
    adapter: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: bert-base-uncased): Name or path of the pretrained model.
hidden_dropout_prob (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (default: 0.1): The dropout ratio for the attention probabilities.
max_position_embeddings (default: 512): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
classifier_dropout (default: null): The dropout ratio for the classification head.
reduce_output (default: cls_pooled): The method used to reduce a sequence of tensors down to a single tensor.
hidden_size (default: 768): Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (default: 12): Number of hidden layers in the Transformer encoder.
num_attention_heads (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (default: 3072): Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. Options: gelu, relu, silu, gelu_new.
type_vocab_size (default: 2): The vocabulary size of the token_type_ids passed when calling BertModel or TFBertModel.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (default: 1e-12): The epsilon used by the layer normalization layers.
pad_token_id (default: 0): The ID of the token to use as padding.
gradient_checkpointing (default: false): Whether to use gradient checkpointing. Options: true, false.
position_embedding_type (default: absolute): Type of position embedding. Options: absolute, relative_key, relative_key_query.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.
adapter (default: null): Whether to use parameter-efficient fine-tuning

CamemBERT¶

The camembert encoder loads a pretrained CamemBERT (default jplu/tf-camembert-base) model using the Hugging Face transformers package. CamemBERT is pre-trained on a large French language web-crawled text corpus.

encoder:
    type: camembert
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 514
    classifier_dropout: null
    reduce_output: sum
    hidden_size: 768
    hidden_act: gelu
    initializer_range: 0.02
    adapter: null
    num_hidden_layers: 12
    num_attention_heads: 12
    intermediate_size: 3072
    type_vocab_size: 1
    layer_norm_eps: 1.0e-05
    pad_token_id: 1
    gradient_checkpointing: false
    position_embedding_type: absolute
    pretrained_kwargs: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: camembert-base): Name or path of the pretrained model.
hidden_dropout_prob (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (default: 0.1): The dropout ratio for the attention probabilities.
max_position_embeddings (default: 514): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
classifier_dropout (default: null): The dropout ratio for the classification head.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
hidden_size (default: 768): Dimensionality of the encoder layers and the pooler layer.
hidden_act (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. Options: gelu, relu, silu, gelu_new.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
adapter (default: null): Whether to use parameter-efficient fine-tuning
num_hidden_layers (default: 12): Number of hidden layers in the Transformer encoder.
num_attention_heads (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (default: 3072): Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
type_vocab_size (default: 1): The vocabulary size of the token_type_ids passed when calling BertModel or TFBertModel.
layer_norm_eps (default: 1e-05): The epsilon used by the layer normalization layers.
pad_token_id (default: 1): The ID of the token to use as padding.
gradient_checkpointing (default: false): Whether to use gradient checkpointing. Options: true, false.
position_embedding_type (default: absolute): Type of position embedding. Options: absolute, relative_key, relative_key_query.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

DeBERTa¶

The DeBERTa encoder improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out performs RoBERTa on a majority of NLU tasks with 80GB training data. In DeBERTa V3, the authors further improved the efficiency of DeBERTa using ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing. Compared to DeBERTa, the V3 version significantly improves the model performance on downstream tasks.

encoder:
    type: deberta
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    hidden_size: 1536
    num_hidden_layers: 24
    num_attention_heads: 24
    intermediate_size: 6144
    hidden_act: gelu
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 512
    type_vocab_size: 0
    initializer_range: 0.02
    layer_norm_eps: 1.0e-12
    relative_attention: true
    max_relative_positions: -1
    pad_token_id: 0
    position_biased_input: false
    pos_att_type:
    - p2c
    - c2p
    pooler_hidden_size: 1536
    pooler_dropout: 0
    pooler_hidden_act: gelu
    position_buckets: 256
    share_att_key: true
    norm_rel_ebd: layer_norm
    adapter: null
    pretrained_kwargs: null
    reduce_output: sum

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: sileod/deberta-v3-base-tasksource-nli): Name or path of the pretrained model.
hidden_size (default: 1536): Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (default: 24): Number of hidden layers in the Transformer encoder.
num_attention_heads (default: 24): Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (default: 6144): Dimensionality of the 'intermediate' (often named feed-forward) layer in the Transformer encoder.
hidden_act (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. Options: gelu, relu, silu, tanh, gelu_fast, mish, linear, sigmoid, gelu_new.
hidden_dropout_prob (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (default: 0.1): The dropout ratio for the attention probabilities.
max_position_embeddings (default: 512): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (default: 0): The vocabulary size of the token_type_ids.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (default: 1e-12): The epsilon used by the layer normalization layers.
relative_attention (default: true): Whether use relative position encoding. Options: true, false.
max_relative_positions (default: -1): The range of relative positions [-max_position_embeddings, max_position_embeddings]. Use the same value as max_position_embeddings.
pad_token_id (default: 0): The value used to pad input_ids.
position_biased_input (default: false): Whether add absolute position embedding to content embedding. Options: true, false.
pos_att_type (default: ["p2c", "c2p"]): The type of relative position attention, it can be a combination of ['p2c', 'c2p'], e.g. ['p2c'], ['p2c', 'c2p'], ['p2c', 'c2p'].
pooler_hidden_size (default: 1536): The hidden size of the pooler layers.
pooler_dropout (default: 0): The dropout ratio for the pooler layers.
pooler_hidden_act (default: gelu): The activation function (function or string) in the pooler. Options: gelu, relu, silu, tanh, gelu_fast, mish, linear, sigmoid, gelu_new.
position_buckets (default: 256): The number of buckets to use for each attention layer.
share_att_key (default: true): Whether to share attention key across layers. Options: true, false.
norm_rel_ebd (default: layer_norm): The normalization method for relative embeddings. Options: layer_norm, none.
adapter (default: null): Whether to use parameter-efficient fine-tuning
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor. Options: cls_pooled, last, sum, mean, max, concat, attention, null.

DistilBERT¶

The distilbert encoder loads a pretrained DistilBERT (default distilbert-base-uncased) model using the Hugging Face transformers package. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

encoder:
    type: distilbert
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    dropout: 0.1
    max_position_embeddings: 512
    attention_dropout: 0.1
    activation: gelu
    reduce_output: sum
    initializer_range: 0.02
    qa_dropout: 0.1
    seq_classif_dropout: 0.2
    adapter: null
    sinusoidal_pos_embds: false
    n_layers: 6
    n_heads: 12
    dim: 768
    hidden_dim: 3072
    pretrained_kwargs: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: distilbert-base-uncased): Name or path of the pretrained model.
dropout (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
max_position_embeddings (default: 512): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
attention_dropout (default: 0.1): The dropout ratio for the attention probabilities.
activation (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. If string, 'gelu', 'relu', 'silu' and 'gelu_new' are supported. Options: gelu, relu, silu, gelu_new.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
qa_dropout (default: 0.1): The dropout probabilities used in the question answering model DistilBertForQuestionAnswering.
seq_classif_dropout (default: 0.2): The dropout probabilities used in the sequence classification and the multiple choice model DistilBertForSequenceClassification.
adapter (default: null): Whether to use parameter-efficient fine-tuning
sinusoidal_pos_embds (default: false): Whether to use sinusoidal positional embeddings. Options: true, false.
n_layers (default: 6): Number of hidden layers in the Transformer encoder.
n_heads (default: 12): Number of hidden layers in the Transformer encoder.
dim (default: 768): Dimensionality of the encoder layers and the pooler layer.
hidden_dim (default: 3072): The size of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

ELECTRA¶

The `electra`` encoder loads a pretrained ELECTRA model using the Hugging Face transformers package. ELECTRA is a new pretraining approach which trains two transformer models the generator and the discriminator. The generator’s role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator, which is the model we’re interested in, tries to identify which tokens were replaced by the generator in the sequence.

encoder:
    type: electra
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 512
    classifier_dropout: null
    reduce_output: sum
    embedding_size: 128
    hidden_size: 256
    hidden_act: gelu
    initializer_range: 0.02
    adapter: null
    num_hidden_layers: 12
    num_attention_heads: 4
    intermediate_size: 1024
    type_vocab_size: 2
    layer_norm_eps: 1.0e-12
    position_embedding_type: absolute
    pretrained_kwargs: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: google/electra-small-discriminator): Name or path of the pretrained model.
hidden_dropout_prob (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (default: 0.1): The dropout ratio for the attention probabilities.
max_position_embeddings (default: 512): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
classifier_dropout (default: null): The dropout ratio for the classification head.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
embedding_size (default: 128): Dimensionality of the encoder layers and the pooler layer.
hidden_size (default: 256): Dimensionality of the encoder layers and the pooler layer.
hidden_act (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. Options: gelu, relu, silu, gelu_new.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
adapter (default: null): Whether to use parameter-efficient fine-tuning
num_hidden_layers (default: 12): Number of hidden layers in the Transformer encoder.
num_attention_heads (default: 4): Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (default: 1024): Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
type_vocab_size (default: 2): The vocabulary size of the token_type_ids passed when calling ElectraModel or TFElectraModel.
layer_norm_eps (default: 1e-12): The epsilon used by the layer normalization layers.
position_embedding_type (default: absolute): Type of position embedding. Options: absolute, relative_key, relative_key_query.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

FlauBERT¶

The flaubert`` encoder loads a pretrained [FlauBERT](https://arxiv.org/abs/1912.05372) (defaultjplu/tf-flaubert-base-uncased``) model using the Hugging Face transformers package. FlauBERT has an architecture similar to BERT and is pre-trained on a large French language corpus.

encoder:
    type: flaubert
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    dropout: 0.1
    reduce_output: sum
    pre_norm: true
    layerdrop: 0.2
    emb_dim: 512
    n_layers: 6
    n_heads: 8
    attention_dropout: 0.1
    gelu_activation: true
    sinusoidal_embeddings: false
    causal: false
    asm: false
    n_langs: 1
    use_lang_emb: true
    max_position_embeddings: 512
    embed_init_std: 0.02209708691207961
    init_std: 0.02
    layer_norm_eps: 1.0e-06
    bos_index: 0
    eos_index: 1
    pad_index: 2
    unk_index: 3
    mask_index: 5
    is_encoder: true
    mask_token_id: 0
    lang_id: 0
    pretrained_kwargs: null
    adapter: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: flaubert/flaubert_small_cased): Name of path of the pretrained model.
dropout (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
pre_norm (default: true): Whether to apply the layer normalization before or after the feed forward layer following the attention in each layer (Vaswani et al., Tensor2Tensor for Neural Machine Translation. 2018) Options: true, false.
layerdrop (default: 0.2): Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand with Structured Dropout. ICLR 2020)
emb_dim (default: 512): Dimensionality of the encoder layers and the pooler layer.
n_layers (default: 6): Number of hidden layers in the Transformer encoder.
n_heads (default: 8): Number of attention heads for each attention layer in the Transformer encoder.
attention_dropout (default: 0.1): The dropout probability for the attention mechanism
gelu_activation (default: true): Whether or not to use a gelu activation instead of relu. Options: true, false.
sinusoidal_embeddings (default: false): Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings. Options: true, false.
causal (default: false): Whether or not the model should behave in a causal manner. Causal models use a triangular attention mask in order to only attend to the left-side context instead if a bidirectional context. Options: true, false.
asm (default: false): Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction layer. Options: true, false.
n_langs (default: 1): The number of languages the model handles. Set to 1 for monolingual models.
use_lang_emb (default: true): Whether to use language embeddings. Some models use additional language embeddings, see the multilingual models page for information on how to use them. Options: true, false.
max_position_embeddings (default: 512): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
embed_init_std (default: 0.02209708691207961): The standard deviation of the truncated_normal_initializer for initializing the embedding matrices.
init_std (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices except the embedding matrices.
layer_norm_eps (default: 1e-06): The epsilon used by the layer normalization layers.
bos_index (default: 0): The index of the beginning of sentence token in the vocabulary.
eos_index (default: 1): The index of the end of sentence token in the vocabulary.
pad_index (default: 2): The index of the padding token in the vocabulary.
unk_index (default: 3): The index of the unknown token in the vocabulary.
mask_index (default: 5): The index of the masking token in the vocabulary.
is_encoder (default: true): Whether or not the initialized model should be a transformer encoder or decoder as seen in Vaswani et al. Options: true, false.
mask_token_id (default: 0): Model agnostic parameter to identify masked tokens when generating text in an MLM context.
lang_id (default: 0): The ID of the language used by the model. This parameter is used when generating text in a given language.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.
adapter (default: null): Whether to use parameter-efficient fine-tuning

GPT¶

The gpt encoder loads a pretrained GPT (default openai-gpt) model using the Hugging Face transformers package. GPT is a causal (unidirectional) transformer pre-trained using language modeling on a large corpus with long range dependencies, the Toronto Book Corpus.

encoder:
    type: gpt
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    reduce_output: sum
    initializer_range: 0.02
    adapter: null
    n_positions: 40478
    n_ctx: 512
    n_embd: 768
    n_layer: 12
    n_head: 12
    afn: gelu
    resid_pdrop: 0.1
    embd_pdrop: 0.1
    attn_pdrop: 0.1
    layer_norm_epsilon: 1.0e-05
    pretrained_kwargs: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: openai-gpt): Name or path of the pretrained model.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
adapter (default: null): Whether to use parameter-efficient fine-tuning
n_positions (default: 40478): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_ctx (default: 512): Dimensionality of the causal mask (usually same as n_positions)
n_embd (default: 768): Dimensionality of the embeddings and hidden states.
n_layer (default: 12): Number of hidden layers in the Transformer encoder.
n_head (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
afn (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. Options: gelu, relu, silu.
resid_pdrop (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (default: 0.1): The dropout ratio for the embeddings.
attn_pdrop (default: 0.1): The dropout ratio for the attention.
layer_norm_epsilon (default: 1e-05): The epsilon to use in the layer normalization layers
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

GPT2¶

The gpt2 encoder loads a pretrained GPT-2 (default gpt2) model using the Hugging Face transformers package. GPT-2 is a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.

encoder:
    type: gpt2
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    reduce_output: sum
    initializer_range: 0.02
    adapter: null
    n_positions: 1024
    n_ctx: 1024
    n_embd: 768
    n_layer: 12
    n_head: 12
    n_inner: null
    activation_function: gelu_new
    resid_pdrop: 0.1
    embd_pdrop: 0.1
    attn_pdrop: 0.1
    layer_norm_epsilon: 1.0e-05
    scale_attn_weights: true
    pretrained_kwargs: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: gpt2): Name or path of the pretrained model.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
adapter (default: null): Whether to use parameter-efficient fine-tuning
n_positions (default: 1024): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_ctx (default: 1024): Dimensionality of the causal mask (usually same as n_positions)
n_embd (default: 768): Dimensionality of the embeddings and hidden states.
n_layer (default: 12): Number of hidden layers in the Transformer encoder.
n_head (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
n_inner (default: null): Dimensionality of the inner feed-forward layers. None will set it to 4 times n_embd
activation_function (default: gelu_new): Activation function, to be selected in the list ['relu', 'silu', 'gelu', 'tanh', 'gelu_new']. Options: relu, silu, gelu, tanh, gelu_new.
resid_pdrop (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (default: 0.1): The dropout ratio for the embeddings.
attn_pdrop (default: 0.1): The dropout ratio for the attention.
layer_norm_epsilon (default: 1e-05): The epsilon to use in the layer normalization layers.
scale_attn_weights (default: true): Scale attention weights by dividing by sqrt(hidden_size). Options: true, false.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

Longformer¶

The longformer encoder loads a pretrained Longformer (default allenai/longformer-base-4096) model using the Hugging Face transformers package. Longformer is a good choice for longer text, as it supports sequences up to 4096 tokens long.

encoder:
    type: longformer
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    max_position_embeddings: 4098
    reduce_output: cls_pooled
    attention_window: 512
    sep_token_id: 2
    adapter: null
    type_vocab_size: 1
    pretrained_kwargs: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: allenai/longformer-base-4096): Name or path of the pretrained model.
max_position_embeddings (default: 4098): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
reduce_output (default: cls_pooled): The method used to reduce a sequence of tensors down to a single tensor.
attention_window (default: 512): Size of an attention window around each token. If an int, use the same size for all layers. To specify a different window size for each layer, use a List[int] where len(attention_window) == num_hidden_layers.
sep_token_id (default: 2): ID of the separator token, which is used when building a sequence from multiple sequences
adapter (default: null): Whether to use parameter-efficient fine-tuning
type_vocab_size (default: 1): The vocabulary size of the token_type_ids passed when calling LongformerEncoder
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

RoBERTa¶

The roberta encoder loads a pretrained RoBERTa (default roberta-base) model using the Hugging Face transformers package. Replication of BERT pretraining which may match or exceed the performance of BERT. RoBERTa builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

encoder:
    type: roberta
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    reduce_output: cls_pooled
    eos_token_id: 2
    adapter: null
    pad_token_id: 1
    bos_token_id: 0
    pretrained_kwargs: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: roberta-base): Name or path of the pretrained model.
reduce_output (default: cls_pooled): The method used to reduce a sequence of tensors down to a single tensor.
eos_token_id (default: 2): The end of sequence token ID.
adapter (default: null): Whether to use parameter-efficient fine-tuning
pad_token_id (default: 1): The ID of the token to use as padding.
bos_token_id (default: 0): The beginning of sequence token ID.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

T5¶

The t5 encoder loads a pretrained T5 (default t5-small) model using the Hugging Face transformers package. T5 (Text-to-Text Transfer Transformer) is pre-trained on a huge text dataset crawled from the web and shows good transfer performance on multiple tasks.

encoder:
    type: t5
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    num_layers: 6
    dropout_rate: 0.1
    reduce_output: sum
    d_ff: 2048
    adapter: null
    d_model: 512
    d_kv: 64
    num_decoder_layers: 6
    num_heads: 8
    relative_attention_num_buckets: 32
    layer_norm_eps: 1.0e-06
    initializer_factor: 1
    feed_forward_proj: relu
    pretrained_kwargs: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: t5-small): Name or path of the pretrained model.
num_layers (default: 6): Number of hidden layers in the Transformer encoder.
dropout_rate (default: 0.1): The ratio for all dropout layers.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
d_ff (default: 2048): Size of the intermediate feed forward layer in each T5Block.
adapter (default: null): Whether to use parameter-efficient fine-tuning
d_model (default: 512): Size of the encoder layers and the pooler layer.
d_kv (default: 64): Size of the key, query, value projections per attention head. d_kv has to be equal to d_model // num_heads.
num_decoder_layers (default: 6): Number of hidden layers in the Transformer decoder. Will use the same value as num_layers if not set.
num_heads (default: 8): Number of attention heads for each attention layer in the Transformer encoder.
relative_attention_num_buckets (default: 32): The number of buckets to use for each attention layer.
layer_norm_eps (default: 1e-06): The epsilon used by the layer normalization layers.
initializer_factor (default: 1): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).
feed_forward_proj (default: relu): Type of feed forward layer to be used. Should be one of 'relu' or 'gated-gelu'. T5v1.1 uses the 'gated-gelu' feed forward projection. Original T5 uses 'relu'. Options: relu, gated-gelu.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

TransformerXL¶

The transformer_xl encoder loads a pretrained Transformer-XL (default transfo-xl-wt103) model using the Hugging Face transformers package. Adds novel positional encoding scheme which improves understanding and generation of long-form text up to thousands of tokens. Transformer-XL is a causal (uni-directional) transformer with relative positioning (sinusoïdal) embeddings which can reuse previously computed hidden-states to attend to longer context (memory). This model also uses adaptive softmax inputs and outputs (tied).

encoder:
    type: transformer_xl
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    dropout: 0.1
    reduce_output: sum
    adaptive: true
    adapter: null
    cutoffs:
    - 20000
    - 40000
    - 200000
    d_model: 1024
    d_embed: 1024
    n_head: 16
    d_head: 64
    d_inner: 4096
    div_val: 4
    pre_lnorm: false
    n_layer: 18
    mem_len: 1600
    clamp_len: 1000
    same_length: true
    proj_share_all_but_first: true
    attn_type: 0
    sample_softmax: -1
    dropatt: 0.0
    untie_r: true
    init: normal
    init_range: 0.01
    proj_init_std: 0.01
    init_std: 0.02
    layer_norm_epsilon: 1.0e-05
    eos_token_id: 0
    pretrained_kwargs: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: transfo-xl-wt103): Name or path of the pretrained model.
dropout (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
adaptive (default: true): Whether or not to use adaptive softmax. Options: true, false.
adapter (default: null): Whether to use parameter-efficient fine-tuning
cutoffs (default: [20000, 40000, 200000]): Cutoffs for the adaptive softmax.
d_model (default: 1024): Dimensionality of the model’s hidden states.
d_embed (default: 1024): Dimensionality of the embeddings
n_head (default: 16): Number of attention heads for each attention layer in the Transformer encoder.
d_head (default: 64): Dimensionality of the model’s heads.
d_inner (default: 4096): Inner dimension in FF
div_val (default: 4): Divident value for adapative input and softmax.
pre_lnorm (default: false): Whether or not to apply LayerNorm to the input instead of the output in the blocks. Options: true, false.
n_layer (default: 18): Number of hidden layers in the Transformer encoder.
mem_len (default: 1600): Length of the retained previous heads.
clamp_len (default: 1000): Use the same pos embeddings after clamp_len.
same_length (default: true): Whether or not to use the same attn length for all tokens Options: true, false.
proj_share_all_but_first (default: true): True to share all but first projs, False not to share. Options: true, false.
attn_type (default: 0): Attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.
sample_softmax (default: -1): Number of samples in the sampled softmax.
dropatt (default: 0.0): The dropout ratio for the attention probabilities.
untie_r (default: true): Whether ot not to untie relative position biases. Options: true, false.
init (default: normal): Parameter initializer to use.
init_range (default: 0.01): Parameters initialized by U(-init_range, init_range).
proj_init_std (default: 0.01): Parameters initialized by N(0, init_std)
init_std (default: 0.02): Parameters initialized by N(0, init_std)
layer_norm_epsilon (default: 1e-05): The epsilon to use in the layer normalization layers
eos_token_id (default: 0): The end of sequence token ID.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

XLMRoBERTa¶

The xlmroberta encoder loads a pretrained XLM-RoBERTa (default jplu/tf-xlm-reoberta-base) model using the Hugging Face transformers package. XLM-RoBERTa is a multi-language model similar to BERT, trained on 100 languages. XLM-RoBERTa is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.

encoder:
    type: xlmroberta
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    reduce_output: cls_pooled
    max_position_embeddings: 514
    type_vocab_size: 1
    adapter: null
    pad_token_id: 1
    bos_token_id: 0
    eos_token_id: 2
    add_pooling_layer: true
    pretrained_kwargs: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: xlm-roberta-base): Name or path of the pretrained model.
reduce_output (default: cls_pooled): The method used to reduce a sequence of tensors down to a single tensor.
max_position_embeddings (default: 514): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (default: 1): The vocabulary size of the token_type_ids passed in.
adapter (default: null): Whether to use parameter-efficient fine-tuning
pad_token_id (default: 1): The ID of the token to use as padding.
bos_token_id (default: 0): The beginning of sequence token ID.
eos_token_id (default: 2): The end of sequence token ID.
add_pooling_layer (default: true): Whether to add a pooling layer to the encoder. Options: true, false.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

XLNet¶

The xlnet encoder loads a pretrained XLNet (default xlnet-base-cased) model using the Hugging Face transformers package. XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization order. XLNet outperforms BERT on a variety of benchmarks.

encoder:
    type: xlnet
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    dropout: 0.1
    reduce_output: sum
    ff_activation: gelu
    initializer_range: 0.02
    summary_activation: tanh
    summary_last_dropout: 0.1
    adapter: null
    d_model: 768
    n_layer: 12
    n_head: 12
    d_inner: 3072
    untie_r: true
    attn_type: bi
    layer_norm_eps: 1.0e-12
    mem_len: null
    reuse_len: null
    use_mems_eval: true
    use_mems_train: false
    bi_data: false
    clamp_len: -1
    same_length: false
    summary_type: last
    summary_use_proj: true
    start_n_top: 5
    end_n_top: 5
    pad_token_id: 5
    bos_token_id: 1
    eos_token_id: 2
    pretrained_kwargs: null

Parameters:

use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options: true, false.
trainable (default: false) : Whether to finetune the model on your dataset. Options: true, false.
pretrained_model_name_or_path (default: xlnet-base-cased): Name or path of the pretrained model.
dropout (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
ff_activation (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. If string, 'gelu', 'relu', 'silu' and 'gelu_new' are supported. Options: gelu, relu, silu, gelu_new.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
summary_activation (default: tanh): Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
summary_last_dropout (default: 0.1): Used in the sequence classification and multiple choice models.
adapter (default: null): Whether to use parameter-efficient fine-tuning
d_model (default: 768): Dimensionality of the encoder layers and the pooler layer.
n_layer (default: 12): Number of hidden layers in the Transformer encoder.
n_head (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
d_inner (default: 3072): Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
untie_r (default: true): Whether or not to untie relative position biases Options: true, false.
attn_type (default: bi): The attention type used by the model. Currently only 'bi' is supported. Options: bi.
layer_norm_eps (default: 1e-12): The epsilon used by the layer normalization layers.
mem_len (default: null): The number of tokens to cache. The key/value pairs that have already been pre-computed in a previous forward pass won’t be re-computed.
reuse_len (default: null): The number of tokens in the current batch to be cached and reused in the future.
use_mems_eval (default: true): Whether or not the model should make use of the recurrent memory mechanism in evaluation mode. Options: true, false.
use_mems_train (default: false): Whether or not the model should make use of the recurrent memory mechanism in train mode. Options: true, false.
bi_data (default: false): Whether or not to use bidirectional input pipeline. Usually set to True during pretraining and False during finetuning. Options: true, false.
clamp_len (default: -1): Clamp all relative distances larger than clamp_len. Setting this attribute to -1 means no clamping.
same_length (default: false): Whether or not to use the same attention length for each token. Options: true, false.
summary_type (default: last): Argument used when doing sequence summary. Used in the sequence classification and multiple choice models. Options: last, first, mean, cls_index, attn.
summary_use_proj (default: true): Options: true, false.
start_n_top (default: 5): Used in the SQuAD evaluation script.
end_n_top (default: 5): Used in the SQuAD evaluation script.
pad_token_id (default: 5): The ID of the token to use as padding.
bos_token_id (default: 1): The beginning of sequence token ID.
eos_token_id (default: 2): The end of sequence token ID.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

LLM Encoders¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["Pretrained\n LLM"];
  B --> C["Last\n Hidden\n State"];
  C --> ...;

The LLM encoder processes text with a pretrained LLM (ex. llama-2-7b) passes the last hidden state of the LLM forward to the combiner. Like the LLM model type, adapter-based fine-tuning and quantization can be configured, and any combiner or decoder parameters will be bundled with the adapter weights.

Example config:

encoder:
  type: llm
  base_model: meta-llama/Llama-2-7b-hf
  adapter:
    type: lora
  quantization:
    bits: 4

Parameters:

Base Model¶

The base_model parameter specifies the pretrained large language model to serve as the foundation of your custom LLM.

More information about the base_model parameter can be found here

Adapter¶

LoRA¶

LoRA is a simple, yet effective, method for parameter-efficient fine-tuning of pretrained language models. It works by adding a small number of trainable parameters to the model, which are used to adapt the pretrained parameters to the downstream task. This allows the model to be fine-tuned with a much smaller number of training examples, and can even be used to fine-tune models on tasks that have no training data available at all.

adapter:
    type: lora
    r: 8
    dropout: 0.05
    target_modules: null
    use_rslora: false
    use_dora: false
    alpha: 16
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    bias_type: none

r (default: 8) : Lora attention dimension.
dropout (default: 0.05): The dropout probability for Lora layers.
target_modules (default: null): List of module names or regex expression of the module names to replace with LoRA. For example, ['q', 'v'] or '.decoder.(SelfAttention|EncDecAttention).*(q|v)$'. Defaults to targeting the query and value matrices of all self-attention and encoder-decoder attention layers.
use_rslora (default: false): When set to True, uses Rank-Stabilized LoRA which sets the adapter scaling factor to lora_alpha/math.sqrt(r), since it was proven to work better. Otherwise, it will use the original default value of lora_alpha/r. Paper: https://arxiv.org/abs/2312.03732. Options: true, false.
use_dora (default: false): Enable 'Weight-Decomposed Low-Rank Adaptation' (DoRA). This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, especially at low ranks. Right now, DoRA only supports non-quantized linear layers. DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference. For more information, see https://arxiv.org/abs/2402.09353 Options: true, false.
alpha (default: null): The alpha parameter for Lora scaling. Defaults to 2 * r.
pretrained_adapter_weights (default: null): Path to pretrained weights.
postprocessor :
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately). Options: true, false.
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process. Options: true, false.
bias_type (default: none): Bias type for Lora. Options: none, all, lora_only.

AdaLoRA¶

AdaLoRA is an extension of LoRA that allows the model to adapt the pretrained parameters to the downstream task in a task-specific manner. This is done by adding a small number of trainable parameters to the model, which are used to adapt the pretrained parameters to the downstream task. This allows the model to be fine-tuned with a much smaller number of training examples, and can even be used to fine-tune models on tasks that have no training data available at all.

adapter:
    type: adalora
    r: 8
    dropout: 0.05
    target_modules: null
    use_rslora: false
    use_dora: false
    alpha: 16
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    bias_type: none
    target_r: 8
    init_r: 12
    tinit: 0
    tfinal: 0
    delta_t: 1
    beta1: 0.85
    beta2: 0.85
    orth_reg_weight: 0.5
    total_step: null
    rank_pattern: null

r (default: 8) : Lora attention dimension.
dropout (default: 0.05): The dropout probability for Lora layers.
target_modules (default: null): List of module names or regex expression of the module names to replace with LoRA. For example, ['q', 'v'] or '.decoder.(SelfAttention|EncDecAttention).*(q|v)$'. Defaults to targeting the query and value matrices of all self-attention and encoder-decoder attention layers.
use_rslora (default: false): When set to True, uses Rank-Stabilized LoRA which sets the adapter scaling factor to lora_alpha/math.sqrt(r), since it was proven to work better. Otherwise, it will use the original default value of lora_alpha/r. Paper: https://arxiv.org/abs/2312.03732. Options: true, false.
use_dora (default: false): Enable 'Weight-Decomposed Low-Rank Adaptation' (DoRA). This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, especially at low ranks. Right now, DoRA only supports non-quantized linear layers. DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference. For more information, see https://arxiv.org/abs/2402.09353 Options: true, false.
alpha (default: null): The alpha parameter for Lora scaling. Defaults to 2 * r.
pretrained_adapter_weights (default: null): Path to pretrained weights.
postprocessor :
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately). Options: true, false.
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process. Options: true, false.
bias_type (default: none): Bias type for Lora. Options: none, all, lora_only.
target_r (default: 8): Target Lora Matrix Dimension. The target average rank of incremental matrix.
init_r (default: 12): Initial Lora Matrix Dimension. The initial rank for each incremental matrix.
tinit (default: 0): The steps of initial fine-tuning warmup.
tfinal (default: 0): The steps of final fine-tuning warmup.
delta_t (default: 1): The time internval between two budget allocations. The step interval of rank allocation.
beta1 (default: 0.85): The hyperparameter of EMA for sensitivity smoothing.
beta2 (default: 0.85): The hyperparameter of EMA for undertainty quantification.
orth_reg_weight (default: 0.5): The coefficient of orthogonality regularization.
total_step (default: null): The total training steps that should be specified before training.
rank_pattern (default: null): The allocated rank for each weight matrix by RankAllocator.

IA3¶

Infused Adapter by Inhibiting and Amplifying Inner Activations, or IA3, is a method that adds three learned vectors l_k``,l_v`, andl_ff`, to rescale the keys and values of the self-attention and encoder-decoder attention layers, and the intermediate activation of the position-wise feed-forward network respectively. These learned vectors are the only trainable parameters during fine-tuning, and thus the original weights remain frozen. Dealing with learned vectors (as opposed to learned low-rank updates to a weight matrix like LoRA) keeps the number of trainable parameters much smaller.

adapter:
    type: ia3
    target_modules: null
    feedforward_modules: null
    fan_in_fan_out: false
    modules_to_save: null
    init_ia3_weights: true
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false

target_modules (default: null) : The names of the modules to apply (IA)^3 to.
feedforward_modules (default: null) : The names of the modules to be treated as feedforward modules, as in the original paper. These modules will have (IA)^3 vectors multiplied to the input, instead of the output. feedforward_modules must be a name or a subset of names present in target_modules.
fan_in_fan_out (default: false) : Set this to True if the layer to replace stores weight like (fan_in, fan_out). For example, gpt-2 uses Conv1D which stores weights like (fan_in, fan_out) and hence this should be set to True. Options: true, false.
modules_to_save (default: null) : List of modules apart from (IA)^3 layers to be set as trainable and saved in the final checkpoint.
init_ia3_weights (default: true) : Whether to initialize the vectors in the (IA)^3 layers, defaults to True. Options: true, false.
pretrained_adapter_weights (default: null): Path to pretrained weights.
postprocessor :
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately). Options: true, false.
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process. Options: true, false.

More information about the adapter config can be found here.

Quantization¶

Attention

Quantized fine-tuning currently requires using adapter: lora. In-context learning does not have this restriction.

Attention

Quantization is currently only supported with backend: local.

quantization:
    bits: 4
    llm_int8_threshold: 6.0
    llm_int8_has_fp16_weight: false
    bnb_4bit_compute_dtype: float16
    bnb_4bit_use_double_quant: true
    bnb_4bit_quant_type: nf4

bits (default: 4) : The quantization level to apply to weights on load. Options: 4, 8.
llm_int8_threshold (default: 6.0): This corresponds to the outlier threshold for outlier detection as described in LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale paper: https://arxiv.org/abs/2208.07339. Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).
llm_int8_has_fp16_weight (default: false): This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not have to be converted back and forth for the backward pass. Options: true, false.
bnb_4bit_compute_dtype (default: float16): This sets the computational type which might be different than the input type. For example, inputs might be fp32, but computation can be set to bf16 for speedups. Options: float32, float16, bfloat16.
bnb_4bit_use_double_quant (default: true): This flag is used for nested quantization where the quantization constants from the first quantization are quantized again. Options: true, false.
bnb_4bit_quant_type (default: nf4): This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options: fp4, nf4.

More information about quantization parameters can be found here.

Model Parameters¶

More information about the model initialization parameters can be found here.

Output Features¶

Text output features are a special case of Sequence Features, so all options of sequence features are available for text features as well.

Text output features can be used for either tagging (classifying each token of an input sequence) or text generation (generating text by repeatedly sampling from the model). There are two decoders available for these tasks named tagger and generator respectively.

Example text output feature using default parameters:

name: text_column_name
type: text
reduce_input: null
dependencies: []
reduce_dependencies: sum
loss:
    type: softmax_cross_entropy
    confidence_penalty: 0
    robust_lambda: 0
    class_weights: 1
    class_similarities_temperature: 0
decoder:
    type: generator

Parameters:

reduce_input (default sum): defines how to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are: sum, mean or avg, max, concat (concatenates along the sequence dimension), last (returns the last vector of the sequence dimension).
dependencies (default []): the output features this one is dependent on. For a detailed explanation refer to Output Feature Dependencies.
reduce_dependencies (default sum): defines how to reduce the output of a dependent feature that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are: sum, mean or avg, max, concat (concatenates along the sequence dimension), last (returns the last vector of the sequence dimension).
loss (default {type: softmax_cross_entropy, class_similarities_temperature: 0, class_weights: 1, confidence_penalty: 0, robust_lambda: 0}): is a dictionary containing a loss type. The only available loss type for text features is softmax_cross_entropy. See Loss for details.
decoder (default: {"type": "generator"}): Decoder for the desired task. Options: generator, tagger. See Decoder for details.

Decoder type and decoder parameters can also be defined once and applied to all text output features using the Type-Global Decoder section. Loss and loss related parameters can also be defined once in the same way.

Decoders¶

Generator¶

graph LR
  A["Combiner Output"] --> B["Fully\n Connected\n Layers"];
  B --> C1["RNN"] --> C2["RNN"] --> C3["RNN"];
  GO(["GO"]) -.-o C1;
  C1 -.-o O1("Output");
  O1 -.-o C2;
  C2 -.-o O2("Output");
  O2 -.-o C3;
  C3 -.-o END(["END"]);
  subgraph DEC["DECODER.."]
  B
  C1
  C2
  C3
  end

In the case of generator the decoder is a (potentially empty) stack of fully connected layers, followed by an RNN that generates outputs feeding on its own previous predictions and generates a tensor of size b x s' x c, where b is the batch size, s' is the length of the generated sequence and c is the number of classes, followed by a softmax_cross_entropy. During training teacher forcing is adopted, meaning the list of targets is provided as both inputs and outputs (shifted by 1), while at evaluation time greedy decoding (generating one token at a time and feeding it as input for the next step) is performed by beam search, using a beam of 1 by default. In general a generator expects a b x h shaped input tensor, where h is a hidden dimension. The h vectors are (after an optional stack of fully connected layers) fed into the rnn generator. One exception is when the generator uses attention, as in that case the expected size of the input tensor is b x s x h, which is the output of a sequence, text or time series input feature without reduced outputs or the output of a sequence-based combiner. If a b x h input is provided to a generator decoder using an RNN with attention instead, an error will be raised during model building.

decoder:
    type: generator
    num_fc_layers: 0
    fc_output_size: 256
    fc_norm: null
    fc_dropout: 0.0
    cell_type: gru
    num_layers: 1
    fc_activation: relu
    reduce_input: sum
    fc_layers: null
    fc_use_bias: true
    fc_weights_initializer: xavier_uniform
    fc_bias_initializer: zeros
    fc_norm_params: null

Parameters:

num_fc_layers (default: 0) : Number of fully-connected layers if fc_layers not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
fc_output_size (default: 256) : Output size of fully connected stack.
fc_norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null.
fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
cell_type (default: gru) : Type of recurrent cell to use. Options: rnn, lstm, gru.
num_layers (default: 1) : The number of stacked recurrent layers.
fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
reduce_input (default: sum): How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension) Options: sum, mean, avg, max, concat, last.
fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
fc_use_bias (default: true): Whether the layer uses a bias vector in the fc_stack. Options: true, false.
fc_weights_initializer (default: xavier_uniform): The weights initializer to use for the layers in the fc_stack
fc_bias_initializer (default: zeros): The bias initializer to use for the layers in the fc_stack
fc_norm_params (default: null): Default parameters passed to the norm module.

Tagger¶

graph LR
  A["emb[0]\n....\nemb[n]"] --> B["Fully\n Connected\n Layers"];
  B --> C["Projection\n....\nProjection"];
  C --> D["Softmax\n....\nSoftmax"];
  subgraph DEC["DECODER.."]
  B
  C
  D
  end
  subgraph COM["COMBINER OUT.."]
  A
  end

In the case of tagger the decoder is a (potentially empty) stack of fully connected layers, followed by a projection into a tensor of size b x s x c, where b is the batch size, s is the length of the sequence and c is the number of classes, followed by a softmax_cross_entropy. This decoder requires its input to be shaped as b x s x h, where h is a hidden dimension, which is the output of a sequence, text or time series input feature without reduced outputs or the output of a sequence-based combiner. If a b x h input is provided instead, an error will be raised during model building.

decoder:
    type: tagger
    num_fc_layers: 0
    fc_output_size: 256
    fc_norm: null
    fc_dropout: 0.0
    fc_activation: relu
    attention_embedding_size: 256
    fc_layers: null
    fc_use_bias: true
    fc_weights_initializer: xavier_uniform
    fc_bias_initializer: zeros
    fc_norm_params: null
    use_attention: false
    use_bias: true
    attention_num_heads: 8

Parameters:

num_fc_layers (default: 0) : Number of fully-connected layers if fc_layers not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
fc_output_size (default: 256) : Output size of fully connected stack.
fc_norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null.
fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
attention_embedding_size (default: 256): The embedding size of the multi-head self attention layer.
fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
fc_use_bias (default: true): Whether the layer uses a bias vector in the fc_stack. Options: true, false.
fc_weights_initializer (default: xavier_uniform): The weights initializer to use for the layers in the fc_stack
fc_bias_initializer (default: zeros): The bias initializer to use for the layers in the fc_stack
fc_norm_params (default: null): Default parameters passed to the norm module.
use_attention (default: false): Whether to apply a multi-head self attention layer before prediction. Options: true, false.
use_bias (default: true): Whether the layer uses a bias vector. Options: true, false.
attention_num_heads (default: 8): The number of attention heads in the multi-head self attention layer.

Loss¶

Sequence Softmax Cross Entropy¶

loss:
    type: sequence_softmax_cross_entropy
    class_weights: null
    weight: 1.0
    robust_lambda: 0
    confidence_penalty: 0
    class_similarities: null
    class_similarities_temperature: 0
    unique: false

Parameters:

class_weights (default: null) : Weights to apply to each class in the loss. If not specified, all classes are weighted equally. The value can be a vector of weights, one for each class, that is multiplied to the loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in case of unbalanced class distribution. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the <UNK> class needs to be included too). Alternatively, the value can be a dictionary with class strings as keys and weights as values, like {class_a: 0.5, class_b: 0.7, ...}.
weight (default: 1.0): Weight of the loss.
robust_lambda (default: 0): Replaces the loss with (1 - robust_lambda) * loss + robust_lambda / c where c is the number of classes. Useful in case of noisy labels.
confidence_penalty (default: 0): Penalizes overconfident predictions (low entropy) by adding an additional term that penalizes too confident predictions by adding a a * (max_entropy - entropy) / max_entropy term to the loss, where a is the value of this parameter. Useful in case of noisy labels.
class_similarities (default: null): If not null it is a c x c matrix in the form of a list of lists that contains the mutual similarity of classes. It is used if class_similarities_temperature is greater than 0. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the <UNK> class needs to be included too).
class_similarities_temperature (default: 0): The temperature parameter of the softmax that is performed on each row of class_similarities. The output of that softmax is used to determine the supervision vector to provide instead of the one hot vector that would be provided otherwise for each datapoint. The intuition behind it is that errors between similar classes are more tolerable than errors between really different classes.
unique (default: false): If true, the loss is only computed for unique elements in the sequence. Options: true, false.

Metrics¶

The metrics available for text features are the same as for Sequence Features:

sequence_accuracy The rate at which the model predicted the correct sequence.
token_accuracy The number of tokens correctly predicted divided by the total number of tokens in all sequences.
last_accuracy Accuracy considering only the last element of the sequence. Useful to ensure special end-of-sequence tokens are generated or tagged.
edit_distance Levenshtein distance: the minimum number of single-token edits (insertions, deletions or substitutions) required to change predicted sequence to ground truth.
perplexity Perplexity is the inverse of the predicted probability of the ground truth sequence, normalized by the number of tokens. The lower the perplexity, the higher the probability of predicting the true sequence.
loss The value of the loss function.

You can set any of the above as validation_metric in the training section of the configuration if validation_field names a sequence feature.