⇅ Text Features

Preprocessing¶

Text features are an extension of sequence features. Text inputs are processed by a tokenizer which maps the raw text input into a sequence of tokens. An integer id is assigned to each unique token. Using this mapping, each text string is converted first to a sequence of tokens, and next to a sequence of integers.

The list of tokens and their integer representations (vocabulary) is stored in the metadata of the model. In the case of a text output feature, this same mapping is used to post-process predictions to text.

preprocessing:
    pretrained_model_name_or_path: null
    tokenizer: space_punct
    vocab_file: null
    sequence_length: null
    max_sequence_length: 256
    most_common: 20000
    padding_symbol: <PAD>
    unknown_symbol: <UNK>
    padding: right
    lowercase: false
    missing_value_strategy: fill_with_const
    fill_value: <UNK>
    computed_fill_value: <UNK>
    ngram_size: 2
    cache_encoder_embeddings: false
    compute_idf: false
    prompt:
        template: null
        task: null
        retrieval:
            type: null
            index_name: null
            model_name: null
            k: 0

Parameters:

pretrained_model_name_or_path (default: null):
tokenizer (default: space_punct):
vocab_file (default: null):
sequence_length (default: null):
max_sequence_length (default: 256):
most_common (default: 20000):
padding_symbol (default: <PAD>):
unknown_symbol (default: <UNK>):
padding (default: right):
lowercase (default: false):
missing_value_strategy (default: fill_with_const): See Missing Value Strategy for details.
fill_value (default: <UNK>):
computed_fill_value (default: <UNK>):
ngram_size (default: 2):
cache_encoder_embeddings (default: false):
compute_idf (default: false):
prompt (default: null):
prompt.template (default: null) : The template to use for the prompt. Must contain at least one of the columns from the input dataset or __sample__ as a variable surrounded in curly brackets {} to indicate where to insert the current feature. Multiple columns can be inserted, e.g.: The {color} {animal} jumped over the {size} {object}, where every term in curly brackets is a column in the dataset. If a task is specified, then the template must also contain the __task__ variable. If retrieval is specified, then the template must also contain the __context__ variable. If no template is provided, then a default will be used based on the retrieval settings, and a task must be set in the config.
prompt.task (default: null) : The task to use for the prompt. Required if template is not set.
prompt.retrieval (default: null):

Preprocessing parameters can also be defined once and applied to all text input features using the Type-Global Preprocessing section.

Note

If a text feature's encoder specifies a huggingface model, then the tokenizer for that model will be used automatically.

Input Features¶

The encoder parameters specified at the feature level are:

tied (default null): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.

Example text feature entry in the input features list:

name: text_column_name
type: text
tied: null
encoder:
    type: bert
    trainable: true

Parameters:

type (default parallel_cnn): encoder to use for the input text feature. The available encoders include encoders used for Sequence Features as well as pre-trained text encoders from the face transformers library: albert, auto_transformer, bert, camembert, distilbert, electra, gpt, gpt2, longformer, modernbert, roberta, t5, mt5, xlm, xlmroberta, xlnet.

Encoder type and encoder parameters can also be defined once and applied to all text input features using the Type-Global Encoder section.

Encoders¶

Embed Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["Aggregation\n Reduce\n Operation"];
  C --> ...;

The embed encoder simply maps each token in the input sequence to an embedding, creating a b x s x h tensor where b is the batch size, s is the length of the sequence and h is the embedding size. The tensor is reduced along the s dimension to obtain a single vector of size h for each element of the batch. If you want to output the full b x s x h tensor, you can specify reduce_output: null.

encoder:
    type: embed
    dropout: 0.0
    embedding_size: 256
    representation: dense
    weights_initializer: uniform
    reduce_output: sum
    embeddings_on_cpu: false
    embeddings_trainable: true
    skip: false
    adapter: null
    pretrained_embeddings: null

Parameters:

dropout (default: 0.0) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
weights_initializer (default: uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
reduce_output (default: sum): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable.
skip (default: false):
adapter (default: null):
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

Parallel CNN Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  C --> D1["1D Conv\n Width 2"] --> E1["Pool"];
  C --> D2["1D Conv\n Width 3"] --> E2["Pool"];
  C --> D3["1D Conv\n Width 4"] --> E3["Pool"];
  C --> D4["1D Conv\n Width 5"] --> E4["Pool"];
  E1 --> F["Concat"] --> G["Fully\n Connected\n Layers"] --> H["..."];
  E2 --> F;
  E3 --> F;
  E4 --> F;

The parallel cnn encoder is inspired by Yoon Kim's Convolutional Neural Network for Sentence Classification. It works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a number of parallel 1d convolutional layers with different filter size (by default 4 layers with filter size 2, 3, 4 and 5), followed by max pooling and concatenation. This single vector concatenating the outputs of the parallel convolutional layers is then passed through a stack of fully connected layers and returned as a b x h tensor where h is the output size of the last fully connected layer. If you want to output the full b x s x h tensor, you can specify reduce_output: null.

encoder:
    type: parallel_cnn
    dropout: 0.0
    embedding_size: 256
    num_conv_layers: null
    output_size: 256
    activation: relu
    filter_size: 3
    norm: null
    representation: dense
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: sum
    conv_layers: null
    pool_function: max
    pool_size: null
    norm_params: null
    num_fc_layers: null
    fc_layers: null
    skip: false
    adapter: null
    pretrained_embeddings: null
    num_filters: 256

Parameters:

dropout (default: 0.0) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
num_conv_layers (default: null) : The number of stacked convolutional layers when conv_layers is null.
output_size (default: 256) : The default output_size that will be used for each layer.
activation (default: relu): The default activation function that will be used for each layer. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
filter_size (default: 3): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.
norm (default: null): The default norm that will be used for each layer. Options: batch, layer, null.
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
use_bias (default: true): Whether to use a bias vector.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable.
reduce_output (default: sum): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
conv_layers (default: null): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, num_filters, filter_size, strides, padding, dilation_rate, use_bias, pool_function, pool_padding, pool_size, pool_strides, bias_initializer, weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If both conv_layers and num_conv_layers are null, a default list will be assigned to conv_layers with the value [{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}].
pool_function (default: max): Pooling function to use. max will select the maximum value. Any of average, avg, or mean will compute the mean value Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
pool_size (default: null): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the s sequence dimension after the convolution operation.
norm_params (default: null): Parameters used if norm is either batch or layer.
num_fc_layers (default: null): Number of parallel fully connected layers to use.
fc_layers (default: null): List of dictionaries containing the parameters for each fully connected layer.
skip (default: false):
adapter (default: null):
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.
num_filters (default: 256): Number of filters, and by consequence number of output channels of the 1d convolution.

Stacked CNN Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["1D Conv Layers\n Different Widths"];
  C --> D["Fully\n Connected\n Layers"];
  D --> ...;

The stacked cnn encoder is inspired by Xiang Zhang at all's Character-level Convolutional Networks for Text Classification. It works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of 1d convolutional layers with different filter size (by default 6 layers with filter size 7, 7, 3, 3, 3 and 3), followed by an optional final pool and by a flatten operation. This single flatten vector is then passed through a stack of fully connected layers and returned as a b x h tensor where h is the output size of the last fully connected layer. If you want to output the full b x s x h tensor, you can specify the pool_size of all your conv_layers to be null and reduce_output: null, while if pool_size has a value different from null and reduce_output: null the returned tensor will be of shape b x s' x h, where s' is width of the output of the last convolutional layer.

encoder:
    type: stacked_cnn
    dropout: 0.0
    num_conv_layers: null
    embedding_size: 256
    output_size: 256
    activation: relu
    filter_size: 3
    strides: 1
    norm: null
    representation: dense
    conv_layers: null
    pool_function: max
    pool_size: null
    dilation_rate: 1
    pool_strides: null
    pool_padding: same
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: sum
    norm_params: null
    num_fc_layers: null
    fc_layers: null
    skip: false
    adapter: null
    num_filters: 256
    padding: same
    pretrained_embeddings: null

Parameters:

dropout (default: 0.0) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
num_conv_layers (default: null) : The number of stacked convolutional layers when conv_layers is null.
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
output_size (default: 256) : The default output_size that will be used for each layer.
activation (default: relu): The default activation function that will be used for each layer. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
filter_size (default: 3): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.
strides (default: 1): Stride length of the convolution.
norm (default: null): The default norm that will be used for each layer. Options: batch, layer, null.
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
conv_layers (default: null): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, num_filters, filter_size, strides, padding, dilation_rate, use_bias, pool_function, pool_padding, pool_size, pool_strides, bias_initializer, weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If both conv_layers and num_conv_layers are null, a default list will be assigned to conv_layers with the value [{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}].
pool_function (default: max): Pooling function to use. max will select the maximum value. Any of average, avg, or mean will compute the mean value Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
pool_size (default: null): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the s sequence dimension after the convolution operation.
dilation_rate (default: 1): Dilation rate to use for dilated convolution.
pool_strides (default: null): Factor to scale down.
pool_padding (default: same): Padding to use. Options: valid, same.
use_bias (default: true): Whether to use a bias vector.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable.
reduce_output (default: sum): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
norm_params (default: null): Parameters used if norm is either batch or layer.
num_fc_layers (default: null): Number of parallel fully connected layers to use.
fc_layers (default: null): List of dictionaries containing the parameters for each fully connected layer.
skip (default: false):
adapter (default: null):
num_filters (default: 256): Number of filters, and by consequence number of output channels of the 1d convolution.
padding (default: same): Padding to use. Options: valid, same.
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

Stacked Parallel CNN Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  C --> D1["1D Conv\n Width 2"] --> E["Concat"];
  C --> D2["1D Conv\n Width 3"] --> E;
  C --> D3["1D Conv\n Width 4"] --> E;
  C --> D4["1D Conv\n Width 5"] --> E;
  E --> F["..."];
  F --> G1["1D Conv\n Width 2"] --> H["Concat"];
  F --> G2["1D Conv\n Width 3"] --> H;
  F --> G3["1D Conv\n Width 4"] --> H;
  F --> G4["1D Conv\n Width 5"] --> H;
  H --> I["Pool"] --> J["Fully\n Connected\n Layers"] --> K["..."];

The stacked parallel cnn encoder is a combination of the Parallel CNN and the Stacked CNN encoders where each layer of the stack is composed of parallel convolutional layers. It works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of several parallel 1d convolutional layers with different filter size, followed by an optional final pool and by a flatten operation. This single flattened vector is then passed through a stack of fully connected layers and returned as a b x h tensor where h is the output size of the last fully connected layer. If you want to output the full b x s x h tensor, you can specify reduce_output: null.

encoder:
    type: stacked_parallel_cnn
    dropout: 0.0
    embedding_size: 256
    output_size: 256
    activation: relu
    filter_size: 3
    norm: null
    representation: dense
    num_stacked_layers: null
    pool_function: max
    pool_size: null
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: sum
    norm_params: null
    num_fc_layers: null
    fc_layers: null
    skip: false
    adapter: null
    stacked_layers: null
    num_filters: 256
    pretrained_embeddings: null

Parameters:

dropout (default: 0.0) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
output_size (default: 256) : The default output_size that will be used for each layer.
activation (default: relu): The default activation function that will be used for each layer. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
filter_size (default: 3): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.
norm (default: null): The default norm that will be used for each layer. Options: batch, layer, null.
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
num_stacked_layers (default: null): If stacked_layers is null, this is the number of elements in the stack of parallel convolutional layers.
pool_function (default: max): Pooling function to use. max will select the maximum value. Any of average, avg, or mean will compute the mean value Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
pool_size (default: null): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the s sequence dimension after the convolution operation.
use_bias (default: true): Whether to use a bias vector.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable.
reduce_output (default: sum): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
norm_params (default: null): Parameters used if norm is either batch or layer.
num_fc_layers (default: null): Number of parallel fully connected layers to use.
fc_layers (default: null): List of dictionaries containing the parameters for each fully connected layer.
skip (default: false):
adapter (default: null):
stacked_layers (default: null): a nested list of lists of dictionaries containing the parameters of the stack of parallel convolutional layers. The length of the list determines the number of stacked parallel convolutional layers, length of the sub-lists determines the number of parallel conv layers and the content of each dictionary determines the parameters for a specific layer.
num_filters (default: 256): Number of filters, and by consequence number of output channels of the 1d convolution.
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

RNN Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["RNN Layers"];
  C --> D["Fully\n Connected\n Layers"];
  D --> ...;

The rnn encoder works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of recurrent layers (by default 1 layer), followed by a reduce operation that by default only returns the last output, but can perform other reduce functions. If you want to output the full b x s x h where h is the size of the output of the last rnn layer, you can specify reduce_output: null.

encoder:
    type: rnn
    dropout: 0.0
    cell_type: rnn
    num_layers: 1
    state_size: 256
    embedding_size: 256
    output_size: 256
    norm: null
    num_fc_layers: 0
    fc_dropout: 0.0
    recurrent_dropout: 0.0
    activation: tanh
    fc_activation: relu
    recurrent_activation: sigmoid
    representation: dense
    unit_forget_bias: true
    recurrent_initializer: orthogonal
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: last
    norm_params: null
    fc_layers: null
    skip: false
    adapter: null
    bidirectional: false
    pretrained_embeddings: null

Parameters:

dropout (default: 0.0) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
cell_type (default: rnn) : The type of recurrent cell to use. Available values are: rnn, lstm, gru. For reference about the differences between the cells please refer to torch.nn Recurrent Layers. Options: rnn, lstm, gru.
num_layers (default: 1) : The number of stacked recurrent layers.
state_size (default: 256) : The size of the state of the rnn.
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
output_size (default: 256) : The default output_size that will be used for each layer.
norm (default: null) : The default norm that will be used for each layer. Options: batch, layer, ghost, null.
num_fc_layers (default: 0) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
recurrent_dropout (default: 0.0): The dropout rate for the recurrent state
activation (default: tanh): The default activation function. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
recurrent_activation (default: sigmoid): The activation function to use in the recurrent step Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
unit_forget_bias (default: true): If true, add 1 to the bias of the forget gate at initialization
recurrent_initializer (default: orthogonal): The initializer for recurrent matrix weights Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity.
use_bias (default: true): Whether to use a bias vector.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable.
reduce_output (default: last): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
norm_params (default: null): Default parameters passed to the norm module.
fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
skip (default: false):
adapter (default: null):
bidirectional (default: false): If true, two recurrent networks will perform encoding in the forward and backward direction and their outputs will be concatenated.
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

CNN RNN Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C1["CNN Layers"];
  C1 --> C2["RNN Layers"];
  C2 --> D["Fully\n Connected\n Layers"];
  D --> ...;

The cnnrnn encoder works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of convolutional layers (by default 2), that is followed by a stack of recurrent layers (by default 1), followed by a reduce operation that by default only returns the last output, but can perform other reduce functions. If you want to output the full b x s x h where h is the size of the output of the last rnn layer, you can specify reduce_output: null.

encoder:
    type: cnnrnn
    dropout: 0.0
    conv_dropout: 0.0
    cell_type: rnn
    num_conv_layers: null
    state_size: 256
    embedding_size: 256
    output_size: 256
    norm: null
    num_fc_layers: 0
    fc_dropout: 0.0
    recurrent_dropout: 0.0
    activation: tanh
    filter_size: 5
    strides: 1
    fc_activation: relu
    recurrent_activation: sigmoid
    conv_activation: relu
    representation: dense
    conv_layers: null
    pool_function: max
    pool_size: null
    dilation_rate: 1
    pool_strides: null
    pool_padding: same
    unit_forget_bias: true
    recurrent_initializer: orthogonal
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: last
    norm_params: null
    fc_layers: null
    skip: false
    adapter: null
    num_filters: 256
    padding: same
    num_rec_layers: 1
    bidirectional: false
    pretrained_embeddings: null

Parameters:

dropout (default: 0.0) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
conv_dropout (default: 0.0) : The dropout rate for the convolutional layers
cell_type (default: rnn) : The type of recurrent cell to use. Available values are: rnn, lstm, gru. For reference about the differences between the cells please refer to torch.nn Recurrent Layers. Options: rnn, lstm, gru.
num_conv_layers (default: null) : The number of stacked convolutional layers when conv_layers is null.
state_size (default: 256) : The size of the state of the rnn.
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
output_size (default: 256) : The default output_size that will be used for each layer.
norm (default: null) : The default norm that will be used for each layer. Options: batch, layer, ghost, null.
num_fc_layers (default: 0) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
recurrent_dropout (default: 0.0): The dropout rate for the recurrent state
activation (default: tanh): The default activation function to use. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
filter_size (default: 5): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.
strides (default: 1): Stride length of the convolution.
fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
recurrent_activation (default: sigmoid): The activation function to use in the recurrent step Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
conv_activation (default: relu): The default activation function that will be used for each convolutional layer. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
conv_layers (default: null): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, num_filters, filter_size, strides, padding, dilation_rate, use_bias, pool_function, pool_padding, pool_size, pool_strides, bias_initializer, weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If both conv_layers and num_conv_layers are null, a default list will be assigned to conv_layers with the value [{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}].
pool_function (default: max): Pooling function to use. max will select the maximum value. Any of average, avg, or mean will compute the mean value Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
pool_size (default: null): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the s sequence dimension after the convolution operation.
dilation_rate (default: 1): Dilation rate to use for dilated convolution.
pool_strides (default: null): Factor to scale down.
pool_padding (default: same): Padding to use. Options: valid, same.
unit_forget_bias (default: true): If true, add 1 to the bias of the forget gate at initialization
recurrent_initializer (default: orthogonal): The initializer for recurrent matrix weights Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity.
use_bias (default: true): Whether to use a bias vector.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable.
reduce_output (default: last): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
norm_params (default: null): Default parameters passed to the norm module.
fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
skip (default: false):
adapter (default: null):
num_filters (default: 256): Number of filters, and by consequence number of output channels of the 1d convolution.
padding (default: same): Padding to use. Options: valid, same.
num_rec_layers (default: 1): The number of stacked recurrent layers.
bidirectional (default: false): If true, two recurrent networks will perform encoding in the forward and backward direction and their outputs will be concatenated.
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

Transformer Encoder¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["Transformer\n Blocks"];
  C --> D["Fully\n Connected\n Layers"];
  D --> ...;

The transformer encoder implements a stack of transformer blocks, replicating the architecture introduced in the Attention is all you need paper, and adds am optional stack of fully connected layers at the end.

encoder:
    type: transformer
    dropout: 0.1
    num_layers: 1
    embedding_size: 256
    output_size: 256
    norm: null
    num_fc_layers: 0
    fc_dropout: 0.0
    hidden_size: 256
    transformer_output_size: 256
    fc_activation: relu
    representation: dense
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: last
    norm_params: null
    fc_layers: null
    skip: false
    adapter: null
    num_heads: 8
    pretrained_embeddings: null
    use_rope: false

Parameters:

dropout (default: 0.1) : The dropout rate for the transformer block. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
num_layers (default: 1) : The number of transformer layers.
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
output_size (default: 256) : The default output_size that will be used for each layer.
norm (default: null) : The default norm that will be used for each layer. Options: batch, layer, ghost, null.
num_fc_layers (default: 0) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
hidden_size (default: 256): The size of the hidden representation within the transformer block. It is usually the same as the embedding_size, but if the two values are different, a projection layer will be added before the first transformer block.
transformer_output_size (default: 256): Size of the fully connected layer after self attention in the transformer block. This is usually the same as hidden_size and embedding_size.
fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
use_bias (default: true): Whether to use a bias vector.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable.
reduce_output (default: last): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
norm_params (default: null): Default parameters passed to the norm module.
fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
skip (default: false):
adapter (default: null):
num_heads (default: 8): Number of attention heads in each transformer block.
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.
use_rope (default: false): If True, use Rotary Position Embeddings (RoPE) instead of absolute positional embeddings. RoPE encodes position by rotating query and key vectors, providing better length generalization and relative position awareness. Used by modern LLMs like LLaMA and Mistral.

ModernBERT Encoder¶

The ModernBERT encoder (Warner et al., 2024) is the first major architectural upgrade to BERT, incorporating Flash Attention 2, Rotary Position Embeddings (RoPE), GeGLU activations, and alternating local/global attention patterns. It supports context lengths up to 8192 tokens (compared to BERT's 512), making it suitable for longer documents without needing a specialized long-context model like Longformer.

ModernBERT matches or exceeds RoBERTa and DeBERTa on most NLU benchmarks while being significantly faster due to Flash Attention and unpadding optimizations. It is the recommended default for new projects.

Default pretrained model: answerdotai/ModernBERT-base

encoder:
    type: modernbert
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: answerdotai/ModernBERT-base
    hidden_dropout_prob: 0.0
    max_position_embeddings: 8192
    reduce_output: cls_pooled
    hidden_size: 768
    num_hidden_layers: 22
    num_attention_heads: 12
    intermediate_size: 1152
    hidden_act: gelu
    initializer_range: 0.02
    layer_norm_eps: 1.0e-05
    pad_token_id: 50283
    pretrained_kwargs: null
    skip: false
    adapter: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: answerdotai/ModernBERT-base): Name or path of the pretrained model.
hidden_dropout_prob (default: 0.0): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
max_position_embeddings (default: 8192): The maximum sequence length that this model might ever be used with.
reduce_output (default: cls_pooled): The method used to reduce a sequence of tensors down to a single tensor.
hidden_size (default: 768): Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (default: 22): Number of hidden layers in the Transformer encoder.
num_attention_heads (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (default: 1152): Dimensionality of the 'intermediate' (often named feed-forward) layer in the Transformer encoder.
hidden_act (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. Options: gelu, relu, silu, gelu_new.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (default: 1e-05): The epsilon used by the layer normalization layers.
pad_token_id (default: 50283): The ID of the token to use as padding.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.
skip (default: false):
adapter (default: null):

TF-IDF Encoder¶

The tf_idf encoder computes Term Frequency-Inverse Document Frequency features for text input. This is a non-neural baseline that produces sparse, interpretable features. It is useful for very small datasets where pretrained transformers may overfit, or when you need a simple baseline for comparison.

The encoder supports n-gram features via ngram_range and document frequency filtering via min_df and max_df parameters.

encoder:
    type: tf_idf
    skip: false
    adapter: null
    ngram_range:
    - 1
    - 1
    max_df: 1.0
    min_df: 1

Parameters:

skip (default: false):
adapter (default: null):
ngram_range (default: null): The range of n-gram sizes to use for tokenization, as a (min_n, max_n) tuple. For example, (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, (1, 3) means unigrams, bigrams, and trigrams. Applied during preprocessing.
max_df (default: 1.0): Maximum document frequency threshold for pruning. Terms that appear in more than this fraction of documents are ignored. Set to a value < 1.0 to remove corpus-specific stop words. Applied during preprocessing.
min_df (default: 1): Minimum document frequency threshold for pruning. Terms that appear in fewer than this many documents are ignored. Useful for removing very rare terms. Applied during preprocessing.

Mamba-2 Encoder¶

Mamba-2 (Dao & Gu, 2024) is a linear-time selective State Space Model (SSM) encoder. It processes sequences in O(L) time instead of O(L²) for attention, making it efficient for long texts. Ludwig's Mamba-2 encoder is a pure-PyTorch implementation requiring no special CUDA kernels.

When to use: Long text sequences (>1k tokens) where Transformer attention becomes a bottleneck. Mamba-2 is 2-4× faster than equivalent Transformer encoders on sequences >512 tokens.

input_features:
  - name: article_text
    type: text
    encoder:
      type: mamba2
      d_model: 256        # Hidden width of each Mamba-2 block
      n_layers: 4         # Number of stacked Mamba-2 blocks
      num_heads: 8        # Number of SSD heads (d_model * expand_factor must be divisible by num_heads)
      d_conv: 4           # Width of the depthwise 1D convolution
      expand_factor: 2    # Inner expansion factor
      output_size: 256    # Output feature width
      reduce_output: mean # How to reduce the sequence dimension

encoder:
    type: mamba2
    dropout: 0.1
    embedding_size: 256
    representation: dense
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: mean
    skip: false
    adapter: null
    pretrained_embeddings: null
    should_embed: true
    d_model: 256
    n_layers: 4
    num_heads: 8
    d_conv: 4
    expand_factor: 2
    output_size: 256

Parameters:

dropout (default: 0.1) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable.
reduce_output (default: mean): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
skip (default: false):
adapter (default: null):
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.
should_embed (default: true): If True the input sequence is expected to be made of integers and will be mapped into embeddings.
d_model (default: 256): Hidden width of each Mamba-2 block.
n_layers (default: 4): Number of stacked Mamba-2 blocks.
num_heads (default: 8): Number of SSD heads. d_model * expand_factor must be divisible by num_heads.
d_conv (default: 4): Width of the depthwise 1D convolution inside each block.
expand_factor (default: 2): Inner expansion factor for each block.
output_size (default: 256): Output feature width emitted by the encoder.

Jamba Encoder¶

Jamba (Lieber et al., 2024) is a hybrid encoder that interleaves Mamba-2 SSM blocks with Transformer attention blocks. Every attention_every_k-th layer is a Transformer attention layer; the rest are Mamba-2 SSM layers. This gives better stability and accuracy than pure Mamba while retaining much of the efficiency gain.

When to use: Long text where you want some of the efficiency of Mamba-2 but better accuracy than pure SSM. The 1:3 default attention:SSM ratio is a good starting point.

input_features:
  - name: article_text
    type: text
    encoder:
      type: jamba
      d_model: 256
      n_layers: 8          # Total layers (SSM + attention combined)
      attention_every_k: 4 # Every 4th layer is attention (1:3 ratio)
      num_heads: 8
      ffn_size: 1024       # Feed-forward size inside attention blocks
      d_conv: 4
      expand_factor: 2
      output_size: 256
      reduce_output: mean

encoder:
    type: jamba
    dropout: 0.1
    embedding_size: 256
    representation: dense
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: mean
    skip: false
    adapter: null
    pretrained_embeddings: null
    should_embed: true
    d_model: 256
    n_layers: 8
    attention_every_k: 4
    num_heads: 8
    ffn_size: 1024
    d_conv: 4
    expand_factor: 2
    output_size: 256

Parameters:

dropout (default: 0.1) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory.
embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable.
reduce_output (default: mean): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
skip (default: false):
adapter (default: null):
pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.
should_embed (default: true): If True the input sequence is expected to be made of integers and will be mapped into embeddings.
d_model (default: 256): Hidden width of every block — SSM and attention share the same d_model.
n_layers (default: 8): Total number of stacked blocks (SSM + attention combined).
attention_every_k (default: 4): Every attention_every_k-th block is attention, the remainder are SSM. Default 4 gives a 1:3 attention:SSM ratio matching the Jamba paper.
num_heads (default: 8): Number of attention heads (and SSD heads, shared).
ffn_size (default: 1024): Feed-forward width inside each attention block.
d_conv (default: 4): Width of the depthwise 1D convolution inside each SSM block.
expand_factor (default: 2): Inner expansion factor inside each SSM block.
output_size (default: 256): Output feature width emitted by the encoder.

Huggingface encoders¶

All huggingface-based text encoders are configured with the following parameters:

pretrained_model_name_or_path (default is the huggingface default model path for the specified encoder, i.e. bert-base-uncased for BERT). This can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.
reduce_output (default cls_pooled): defines how to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Available values are: cls_pooled, sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension) and null (which does not reduce and returns the full tensor).
trainable (default false): if true the weights of the encoder will be trained, otherwise they will be kept frozen.

Note

Any hyperparameter of any huggingface encoder can be overridden. Check the huggingface documentation for which parameters are used for which models.

name: text_column_name
type: text
encoder: bert
trainable: true
num_attention_heads: 16 # Instead of 12

AutoTransformer¶

The auto_transformer encoder automatically instantiates the model architecture for the specified pretrained_model_name_or_path. Unlike the other HF encoders, auto_transformer does not provide a default value for pretrained_model_name_or_path, this is its only mandatory parameter. See the Hugging Face AutoModels documentation for more details.

encoder:
    type: auto_transformer
    trainable: false
    pretrained_model_name_or_path: bert
    reduce_output: sum
    pretrained_kwargs: null
    skip: false
    adapter: null
    use_pretrained: true

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
pretrained_model_name_or_path (default: null) : Name or path of the pretrained model.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.
skip (default: false):
adapter (default: null):
use_pretrained (default: true): Whether to use the pretrained weights for the model. Always True for AutoTransformers.

ALBERT¶

The albert encoder loads a pretrained ALBERT (default albert-base-v2) model using the Hugging Face transformers package. Albert is similar to BERT, with significantly lower memory usage and somewhat faster training time:.

encoder:
    type: albert
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: bert
    reduce_output: cls_pooled
    embedding_size: 128
    hidden_size: 768
    num_hidden_layers: 12
    num_hidden_groups: 1
    num_attention_heads: 12
    intermediate_size: 3072
    inner_group_num: 1
    hidden_act: gelu_new
    hidden_dropout_prob: 0.0
    attention_probs_dropout_prob: 0.0
    max_position_embeddings: 512
    type_vocab_size: 2
    initializer_range: 0.02
    layer_norm_eps: 1.0e-12
    classifier_dropout_prob: 0.1
    position_embedding_type: absolute
    pad_token_id: 0
    bos_token_id: 2
    eos_token_id: 3
    pretrained_kwargs: null
    skip: false
    adapter: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: albert-base-v2): Name or path of the pretrained model.
reduce_output (default: cls_pooled): The method used to reduce a sequence of tensors down to a single tensor.
embedding_size (default: 128): Dimensionality of vocabulary embeddings.
hidden_size (default: 768): Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (default: 12): Number of hidden layers in the Transformer encoder.
num_hidden_groups (default: 1): Number of groups for the hidden layers, parameters in the same group are shared.
num_attention_heads (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (default: 3072): The dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
inner_group_num (default: 1): The number of inner repetition of attention and ffn.
hidden_act (default: gelu_new): The non-linear activation function (function or string) in the encoder and pooler. Options: gelu, relu, silu, gelu_new.
hidden_dropout_prob (default: 0.0): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (default: 0.0): The dropout ratio for the attention probabilities.
max_position_embeddings (default: 512): The maximum sequence length that this model might ever be used with. Typically set this to something large (e.g., 512 or 1024 or 2048).
type_vocab_size (default: 2): The vocabulary size of the token_type_ids passed when calling AlbertModel or TFAlbertModel.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (default: 1e-12): The epsilon used by the layer normalization layers.
classifier_dropout_prob (default: 0.1): The dropout ratio for attached classifiers.
position_embedding_type (default: absolute): Options: absolute, relative_key, relative_key_query.
pad_token_id (default: 0): The ID of the token to use as padding.
bos_token_id (default: 2): The beginning of sequence token ID.
eos_token_id (default: 3): The end of sequence token ID.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.
skip (default: false):
adapter (default: null):

BERT¶

The bert encoder loads a pretrained BERT (default bert-base-uncased) model using the Hugging Face transformers package. BERT is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.

encoder:
    type: bert
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: bert
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 512
    classifier_dropout: null
    reduce_output: cls_pooled
    hidden_size: 768
    num_hidden_layers: 12
    num_attention_heads: 12
    intermediate_size: 3072
    hidden_act: gelu
    type_vocab_size: 2
    initializer_range: 0.02
    layer_norm_eps: 1.0e-12
    pad_token_id: 0
    gradient_checkpointing: false
    position_embedding_type: absolute
    pretrained_kwargs: null
    skip: false
    adapter: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: bert-base-uncased): Name or path of the pretrained model.
hidden_dropout_prob (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (default: 0.1): The dropout ratio for the attention probabilities.
max_position_embeddings (default: 512): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
classifier_dropout (default: null): The dropout ratio for the classification head.
reduce_output (default: cls_pooled): The method used to reduce a sequence of tensors down to a single tensor.
hidden_size (default: 768): Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (default: 12): Number of hidden layers in the Transformer encoder.
num_attention_heads (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (default: 3072): Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. Options: gelu, relu, silu, gelu_new.
type_vocab_size (default: 2): The vocabulary size of the token_type_ids passed when calling BertModel or TFBertModel.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (default: 1e-12): The epsilon used by the layer normalization layers.
pad_token_id (default: 0): The ID of the token to use as padding.
gradient_checkpointing (default: false): Whether to use gradient checkpointing.
position_embedding_type (default: absolute): Type of position embedding. Options: absolute, relative_key, relative_key_query.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.
skip (default: false):
adapter (default: null):

CamemBERT¶

The camembert encoder loads a pretrained CamemBERT (default jplu/tf-camembert-base) model using the Hugging Face transformers package. CamemBERT is pre-trained on a large French language web-crawled text corpus.

encoder:
    type: camembert
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: bert
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 514
    classifier_dropout: null
    reduce_output: sum
    hidden_size: 768
    hidden_act: gelu
    initializer_range: 0.02
    skip: false
    adapter: null
    num_hidden_layers: 12
    num_attention_heads: 12
    intermediate_size: 3072
    type_vocab_size: 1
    layer_norm_eps: 1.0e-05
    pad_token_id: 1
    gradient_checkpointing: false
    position_embedding_type: absolute
    pretrained_kwargs: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: camembert-base): Name or path of the pretrained model.
hidden_dropout_prob (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (default: 0.1): The dropout ratio for the attention probabilities.
max_position_embeddings (default: 514): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
classifier_dropout (default: null): The dropout ratio for the classification head.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
hidden_size (default: 768): Dimensionality of the encoder layers and the pooler layer.
hidden_act (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. Options: gelu, relu, silu, gelu_new.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
skip (default: false):
adapter (default: null):
num_hidden_layers (default: 12): Number of hidden layers in the Transformer encoder.
num_attention_heads (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (default: 3072): Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
type_vocab_size (default: 1): The vocabulary size of the token_type_ids passed when calling BertModel or TFBertModel.
layer_norm_eps (default: 1e-05): The epsilon used by the layer normalization layers.
pad_token_id (default: 1): The ID of the token to use as padding.
gradient_checkpointing (default: false): Whether to use gradient checkpointing.
position_embedding_type (default: absolute): Type of position embedding. Options: absolute, relative_key, relative_key_query.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

DeBERTa¶

The DeBERTa encoder improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out performs RoBERTa on a majority of NLU tasks with 80GB training data. In DeBERTa V3, the authors further improved the efficiency of DeBERTa using ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing. Compared to DeBERTa, the V3 version significantly improves the model performance on downstream tasks.

encoder:
    type: deberta
    pretrained_model_name_or_path: bert
    vocab_size: null
    hidden_size: 1536
    num_hidden_layers: 24
    num_attention_heads: 24
    intermediate_size: 6144
    hidden_act: gelu
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 512
    type_vocab_size: 0
    initializer_range: 0.02
    layer_norm_eps: 1.0e-12
    relative_attention: true
    max_relative_positions: -1
    pad_token_id: 0
    position_biased_input: false
    pos_att_type:
    - p2c
    - c2p
    pooler_hidden_size: 1536
    pooler_dropout: 0
    pooler_hidden_act: gelu
    position_buckets: 256
    share_att_key: true
    norm_rel_ebd: layer_norm
    skip: false
    adapter: null
    trainable: false
    use_pretrained: true
    reduce_output: sum
    pretrained_kwargs: null
    max_sequence_length: null
    saved_weights_in_checkpoint: false

Parameters:

pretrained_model_name_or_path (default: tasksource/deberta-base-long-nli): Name or path of the pretrained model.
vocab_size (default: null):
hidden_size (default: 1536):
num_hidden_layers (default: 24):
num_attention_heads (default: 24):
intermediate_size (default: 6144):
hidden_act (default: gelu):
hidden_dropout_prob (default: 0.1):
attention_probs_dropout_prob (default: 0.1):
max_position_embeddings (default: 512):
type_vocab_size (default: 0):
initializer_range (default: 0.02):
layer_norm_eps (default: 1e-12):
relative_attention (default: true):
max_relative_positions (default: -1):
pad_token_id (default: 0):
position_biased_input (default: false):
pos_att_type (default: null): The type of relative position attention, it can be a combination of ['p2c', 'c2p'], e.g. ['p2c'], ['p2c', 'c2p'], ['p2c', 'c2p'].
pooler_hidden_size (default: 1536):
pooler_dropout (default: 0):
pooler_hidden_act (default: gelu):
position_buckets (default: 256):
share_att_key (default: true):
norm_rel_ebd (default: layer_norm):
skip (default: false):
adapter (default: null):
trainable (default: false):
use_pretrained (default: true):
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor. Options: cls_pooled, last, sum, mean, max, concat, attention, null.
pretrained_kwargs (default: null):
max_sequence_length (default: null):
saved_weights_in_checkpoint (default: false):

DistilBERT¶

The distilbert encoder loads a pretrained DistilBERT (default distilbert-base-uncased) model using the Hugging Face transformers package. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

encoder:
    type: distilbert
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: bert
    dropout: 0.1
    max_position_embeddings: 512
    attention_dropout: 0.1
    activation: gelu
    reduce_output: sum
    initializer_range: 0.02
    qa_dropout: 0.1
    seq_classif_dropout: 0.2
    skip: false
    adapter: null
    sinusoidal_pos_embds: false
    n_layers: 6
    n_heads: 12
    dim: 768
    hidden_dim: 3072
    pretrained_kwargs: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: distilbert-base-uncased): Name or path of the pretrained model.
dropout (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
max_position_embeddings (default: 512): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
attention_dropout (default: 0.1): The dropout ratio for the attention probabilities.
activation (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. If string, 'gelu', 'relu', 'silu' and 'gelu_new' are supported. Options: gelu, relu, silu, gelu_new.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
qa_dropout (default: 0.1): The dropout probabilities used in the question answering model DistilBertForQuestionAnswering.
seq_classif_dropout (default: 0.2): The dropout probabilities used in the sequence classification and the multiple choice model DistilBertForSequenceClassification.
skip (default: false):
adapter (default: null):
sinusoidal_pos_embds (default: false): Whether to use sinusoidal positional embeddings.
n_layers (default: 6): Number of hidden layers in the Transformer encoder.
n_heads (default: 12): Number of hidden layers in the Transformer encoder.
dim (default: 768): Dimensionality of the encoder layers and the pooler layer.
hidden_dim (default: 3072): The size of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

ELECTRA¶

The `electra`` encoder loads a pretrained ELECTRA model using the Hugging Face transformers package. ELECTRA is a new pretraining approach which trains two transformer models the generator and the discriminator. The generator’s role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator, which is the model we’re interested in, tries to identify which tokens were replaced by the generator in the sequence.

encoder:
    type: electra
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: bert
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 512
    classifier_dropout: null
    reduce_output: sum
    embedding_size: 128
    hidden_size: 256
    hidden_act: gelu
    initializer_range: 0.02
    skip: false
    adapter: null
    num_hidden_layers: 12
    num_attention_heads: 4
    intermediate_size: 1024
    type_vocab_size: 2
    layer_norm_eps: 1.0e-12
    position_embedding_type: absolute
    pretrained_kwargs: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: google/electra-small-discriminator): Name or path of the pretrained model.
hidden_dropout_prob (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (default: 0.1): The dropout ratio for the attention probabilities.
max_position_embeddings (default: 512): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
classifier_dropout (default: null): The dropout ratio for the classification head.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
embedding_size (default: 128): Dimensionality of the encoder layers and the pooler layer.
hidden_size (default: 256): Dimensionality of the encoder layers and the pooler layer.
hidden_act (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. Options: gelu, relu, silu, gelu_new.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
skip (default: false):
adapter (default: null):
num_hidden_layers (default: 12): Number of hidden layers in the Transformer encoder.
num_attention_heads (default: 4): Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (default: 1024): Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
type_vocab_size (default: 2): The vocabulary size of the token_type_ids passed when calling ElectraModel or TFElectraModel.
layer_norm_eps (default: 1e-12): The epsilon used by the layer normalization layers.
position_embedding_type (default: absolute): Type of position embedding. Options: absolute, relative_key, relative_key_query.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

GPT¶

The gpt encoder loads a pretrained GPT (default openai-gpt) model using the Hugging Face transformers package. GPT is a causal (unidirectional) transformer pre-trained using language modeling on a large corpus with long range dependencies, the Toronto Book Corpus.

encoder:
    type: gpt
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: bert
    reduce_output: sum
    initializer_range: 0.02
    skip: false
    adapter: null
    n_positions: 40478
    n_ctx: 512
    n_embd: 768
    n_layer: 12
    n_head: 12
    afn: gelu
    resid_pdrop: 0.1
    embd_pdrop: 0.1
    attn_pdrop: 0.1
    layer_norm_epsilon: 1.0e-05
    pretrained_kwargs: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: openai-gpt): Name or path of the pretrained model.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
skip (default: false):
adapter (default: null):
n_positions (default: 40478): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_ctx (default: 512): Dimensionality of the causal mask (usually same as n_positions)
n_embd (default: 768): Dimensionality of the embeddings and hidden states.
n_layer (default: 12): Number of hidden layers in the Transformer encoder.
n_head (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
afn (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. Options: gelu, relu, silu.
resid_pdrop (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (default: 0.1): The dropout ratio for the embeddings.
attn_pdrop (default: 0.1): The dropout ratio for the attention.
layer_norm_epsilon (default: 1e-05): The epsilon to use in the layer normalization layers
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

GPT2¶

The gpt2 encoder loads a pretrained GPT-2 (default gpt2) model using the Hugging Face transformers package. GPT-2 is a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.

encoder:
    type: gpt2
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: bert
    reduce_output: sum
    initializer_range: 0.02
    skip: false
    adapter: null
    n_positions: 1024
    n_ctx: 1024
    n_embd: 768
    n_layer: 12
    n_head: 12
    n_inner: null
    activation_function: gelu_new
    resid_pdrop: 0.1
    embd_pdrop: 0.1
    attn_pdrop: 0.1
    layer_norm_epsilon: 1.0e-05
    scale_attn_weights: true
    pretrained_kwargs: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: gpt2): Name or path of the pretrained model.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
skip (default: false):
adapter (default: null):
n_positions (default: 1024): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_ctx (default: 1024): Dimensionality of the causal mask (usually same as n_positions)
n_embd (default: 768): Dimensionality of the embeddings and hidden states.
n_layer (default: 12): Number of hidden layers in the Transformer encoder.
n_head (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
n_inner (default: null): Dimensionality of the inner feed-forward layers. None will set it to 4 times n_embd
activation_function (default: gelu_new): Activation function, to be selected in the list ['relu', 'silu', 'gelu', 'tanh', 'gelu_new']. Options: relu, silu, gelu, tanh, gelu_new.
resid_pdrop (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (default: 0.1): The dropout ratio for the embeddings.
attn_pdrop (default: 0.1): The dropout ratio for the attention.
layer_norm_epsilon (default: 1e-05): The epsilon to use in the layer normalization layers.
scale_attn_weights (default: true): Scale attention weights by dividing by sqrt(hidden_size).
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

Longformer¶

The longformer encoder loads a pretrained Longformer (default allenai/longformer-base-4096) model using the Hugging Face transformers package. Longformer is a good choice for longer text, as it supports sequences up to 4096 tokens long.

encoder:
    type: longformer
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: bert
    max_position_embeddings: 4098
    reduce_output: cls_pooled
    skip: false
    adapter: null
    attention_window: 512
    sep_token_id: 2
    type_vocab_size: 1
    pretrained_kwargs: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: allenai/longformer-base-4096): Name or path of the pretrained model.
max_position_embeddings (default: 4098): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
reduce_output (default: cls_pooled): The method used to reduce a sequence of tensors down to a single tensor.
skip (default: false):
adapter (default: null):
attention_window (default: 512): Size of an attention window around each token. If an int, use the same size for all layers. To specify a different window size for each layer, use a List[int] where len(attention_window) == num_hidden_layers.
sep_token_id (default: 2): ID of the separator token, which is used when building a sequence from multiple sequences
type_vocab_size (default: 1): The vocabulary size of the token_type_ids passed when calling LongformerEncoder
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

ModernBERT¶

The modernbert encoder loads a pretrained ModernBERT (default answerdotai/ModernBERT-base) model using the Hugging Face transformers package. ModernBERT is the first major architectural upgrade to BERT, incorporating Flash Attention 2, Rotary Position Embeddings (RoPE), GeGLU activations, unpadding for efficiency, and alternating local/global attention. It supports up to 8192 tokens (vs BERT's 512).

encoder:
    type: modernbert
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: bert
    hidden_dropout_prob: 0.0
    max_position_embeddings: 8192
    reduce_output: cls_pooled
    hidden_size: 768
    num_hidden_layers: 22
    num_attention_heads: 12
    intermediate_size: 1152
    hidden_act: gelu
    initializer_range: 0.02
    layer_norm_eps: 1.0e-05
    pad_token_id: 50283
    pretrained_kwargs: null
    skip: false
    adapter: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: answerdotai/ModernBERT-base): Name or path of the pretrained model.
hidden_dropout_prob (default: 0.0): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
max_position_embeddings (default: 8192): The maximum sequence length that this model might ever be used with.
reduce_output (default: cls_pooled): The method used to reduce a sequence of tensors down to a single tensor.
hidden_size (default: 768): Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (default: 22): Number of hidden layers in the Transformer encoder.
num_attention_heads (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (default: 1152): Dimensionality of the 'intermediate' (often named feed-forward) layer in the Transformer encoder.
hidden_act (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. Options: gelu, relu, silu, gelu_new.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (default: 1e-05): The epsilon used by the layer normalization layers.
pad_token_id (default: 50283): The ID of the token to use as padding.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.
skip (default: false):
adapter (default: null):

RoBERTa¶

The roberta encoder loads a pretrained RoBERTa (default roberta-base) model using the Hugging Face transformers package. Replication of BERT pretraining which may match or exceed the performance of BERT. RoBERTa builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

encoder:
    type: roberta
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: bert
    reduce_output: cls_pooled
    eos_token_id: 2
    skip: false
    adapter: null
    pad_token_id: 1
    bos_token_id: 0
    pretrained_kwargs: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: roberta-base): Name or path of the pretrained model.
reduce_output (default: cls_pooled): The method used to reduce a sequence of tensors down to a single tensor.
eos_token_id (default: 2): The end of sequence token ID.
skip (default: false):
adapter (default: null):
pad_token_id (default: 1): The ID of the token to use as padding.
bos_token_id (default: 0): The beginning of sequence token ID.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

T5¶

The t5 encoder loads a pretrained T5 (default t5-small) model using the Hugging Face transformers package. T5 (Text-to-Text Transfer Transformer) is pre-trained on a huge text dataset crawled from the web and shows good transfer performance on multiple tasks.

encoder:
    type: t5
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: bert
    num_layers: 6
    dropout_rate: 0.1
    reduce_output: sum
    d_ff: 2048
    skip: false
    adapter: null
    d_model: 512
    d_kv: 64
    num_decoder_layers: 6
    num_heads: 8
    relative_attention_num_buckets: 32
    layer_norm_eps: 1.0e-06
    initializer_factor: 1
    feed_forward_proj: relu
    pretrained_kwargs: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: t5-small): Name or path of the pretrained model.
num_layers (default: 6): Number of hidden layers in the Transformer encoder.
dropout_rate (default: 0.1): The ratio for all dropout layers.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
d_ff (default: 2048): Size of the intermediate feed forward layer in each T5Block.
skip (default: false):
adapter (default: null):
d_model (default: 512): Size of the encoder layers and the pooler layer.
d_kv (default: 64): Size of the key, query, value projections per attention head. d_kv has to be equal to d_model // num_heads.
num_decoder_layers (default: 6): Number of hidden layers in the Transformer decoder. Will use the same value as num_layers if not set.
num_heads (default: 8): Number of attention heads for each attention layer in the Transformer encoder.
relative_attention_num_buckets (default: 32): The number of buckets to use for each attention layer.
layer_norm_eps (default: 1e-06): The epsilon used by the layer normalization layers.
initializer_factor (default: 1): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).
feed_forward_proj (default: relu): Type of feed forward layer to be used. Should be one of 'relu' or 'gated-gelu'. T5v1.1 uses the 'gated-gelu' feed forward projection. Original T5 uses 'relu'. Options: relu, gated-gelu.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

XLMRoBERTa¶

The xlmroberta encoder loads a pretrained XLM-RoBERTa (default jplu/tf-xlm-reoberta-base) model using the Hugging Face transformers package. XLM-RoBERTa is a multi-language model similar to BERT, trained on 100 languages. XLM-RoBERTa is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.

encoder:
    type: xlmroberta
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: bert
    reduce_output: cls_pooled
    max_position_embeddings: 514
    type_vocab_size: 1
    skip: false
    adapter: null
    pad_token_id: 1
    bos_token_id: 0
    eos_token_id: 2
    add_pooling_layer: true
    pretrained_kwargs: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: xlm-roberta-base): Name or path of the pretrained model.
reduce_output (default: cls_pooled): The method used to reduce a sequence of tensors down to a single tensor.
max_position_embeddings (default: 514): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (default: 1): The vocabulary size of the token_type_ids passed in.
skip (default: false):
adapter (default: null):
pad_token_id (default: 1): The ID of the token to use as padding.
bos_token_id (default: 0): The beginning of sequence token ID.
eos_token_id (default: 2): The end of sequence token ID.
add_pooling_layer (default: true): Whether to add a pooling layer to the encoder.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

XLNet¶

The xlnet encoder loads a pretrained XLNet (default xlnet-base-cased) model using the Hugging Face transformers package. XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization order. XLNet outperforms BERT on a variety of benchmarks.

encoder:
    type: xlnet
    trainable: false
    use_pretrained: true
    pretrained_model_name_or_path: bert
    dropout: 0.1
    reduce_output: sum
    ff_activation: gelu
    initializer_range: 0.02
    summary_activation: tanh
    summary_last_dropout: 0.1
    skip: false
    adapter: null
    d_model: 768
    n_layer: 12
    n_head: 12
    d_inner: 3072
    untie_r: true
    attn_type: bi
    layer_norm_eps: 1.0e-12
    mem_len: null
    reuse_len: null
    use_mems_eval: true
    use_mems_train: false
    bi_data: false
    clamp_len: -1
    same_length: false
    summary_type: last
    summary_use_proj: true
    start_n_top: 5
    end_n_top: 5
    pad_token_id: 5
    bos_token_id: 1
    eos_token_id: 2
    pretrained_kwargs: null

Parameters:

trainable (default: false) : Whether to finetune the model on your dataset.
use_pretrained (default: true) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive.
pretrained_model_name_or_path (default: xlnet-base-cased): Name or path of the pretrained model.
dropout (default: 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
reduce_output (default: sum): The method used to reduce a sequence of tensors down to a single tensor.
ff_activation (default: gelu): The non-linear activation function (function or string) in the encoder and pooler. If string, 'gelu', 'relu', 'silu' and 'gelu_new' are supported. Options: gelu, relu, silu, gelu_new.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
summary_activation (default: tanh): Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
summary_last_dropout (default: 0.1): Used in the sequence classification and multiple choice models.
skip (default: false):
adapter (default: null):
d_model (default: 768): Dimensionality of the encoder layers and the pooler layer.
n_layer (default: 12): Number of hidden layers in the Transformer encoder.
n_head (default: 12): Number of attention heads for each attention layer in the Transformer encoder.
d_inner (default: 3072): Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
untie_r (default: true): Whether or not to untie relative position biases
attn_type (default: bi): The attention type used by the model. Currently only 'bi' is supported. Options: bi.
layer_norm_eps (default: 1e-12): The epsilon used by the layer normalization layers.
mem_len (default: null): The number of tokens to cache. The key/value pairs that have already been pre-computed in a previous forward pass won’t be re-computed.
reuse_len (default: null): The number of tokens in the current batch to be cached and reused in the future.
use_mems_eval (default: true): Whether or not the model should make use of the recurrent memory mechanism in evaluation mode.
use_mems_train (default: false): Whether or not the model should make use of the recurrent memory mechanism in train mode.
bi_data (default: false): Whether or not to use bidirectional input pipeline. Usually set to True during pretraining and False during finetuning.
clamp_len (default: -1): Clamp all relative distances larger than clamp_len. Setting this attribute to -1 means no clamping.
same_length (default: false): Whether or not to use the same attention length for each token.
summary_type (default: last): Argument used when doing sequence summary. Used in the sequence classification and multiple choice models. Options: last, first, mean, cls_index, attn.
summary_use_proj (default: true):
start_n_top (default: 5): Used in the SQuAD evaluation script.
end_n_top (default: 5): Used in the SQuAD evaluation script.
pad_token_id (default: 5): The ID of the token to use as padding.
bos_token_id (default: 1): The beginning of sequence token ID.
eos_token_id (default: 2): The end of sequence token ID.
pretrained_kwargs (default: null): Additional kwargs to pass to the pretrained model.

LLM Encoders¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["Pretrained\n LLM"];
  B --> C["Last\n Hidden\n State"];
  C --> ...;

The LLM encoder processes text with a pretrained LLM (ex. llama-2-7b) passes the last hidden state of the LLM forward to the combiner. Like the LLM model type, adapter-based fine-tuning and quantization can be configured, and any combiner or decoder parameters will be bundled with the adapter weights.

Example config:

encoder:
  type: llm
  base_model: meta-llama/Llama-2-7b-hf
  adapter:
    type: lora
  quantization:
    bits: 4

Parameters:

Base Model¶

The base_model parameter specifies the pretrained large language model to serve as the foundation of your custom LLM.

More information about the base_model parameter can be found here

Adapter¶

LoRA¶

LoRA is a simple, yet effective, method for parameter-efficient fine-tuning of pretrained language models. It works by adding a small number of trainable parameters to the model, which are used to adapt the pretrained parameters to the downstream task. This allows the model to be fine-tuned with a much smaller number of training examples, and can even be used to fine-tune models on tasks that have no training data available at all.

adapter:
    type: lora
    r: 8
    dropout: 0.05
    target_modules: null
    use_rslora: false
    use_dora: false
    alpha: 16
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    bias_type: none
    loraplus_lr_ratio: null
    init_lora_weights: default
    eva_config: null
    loftq_config: null
    rank_pattern: null
    alpha_pattern: null
    layer_replication: null

r (default: 8) : Lora attention dimension.
dropout (default: 0.05): The dropout probability for Lora layers.
target_modules (default: null): List of module names or regex expression of the module names to replace with LoRA. For example, ['q', 'v'] or '.decoder.(SelfAttention|EncDecAttention).*(q|v)$'. Defaults to targeting the query and value matrices of all self-attention and encoder-decoder attention layers.
use_rslora (default: false): When set to True, uses Rank-Stabilized LoRA which sets the adapter scaling factor to lora_alpha/math.sqrt(r), since it was proven to work better. Otherwise, it will use the original default value of lora_alpha/r. Paper: https://arxiv.org/abs/2312.03732.
use_dora (default: false): Enable 'Weight-Decomposed Low-Rank Adaptation' (DoRA). This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, especially at low ranks. Right now, DoRA only supports non-quantized linear layers. DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference. For more information, see https://arxiv.org/abs/2402.09353
alpha (default: null): The alpha parameter for Lora scaling. Defaults to 2 * r.
pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
bias_type (default: none): Bias type for Lora. Options: none, all, lora_only.
loraplus_lr_ratio (default: null): LoRA+ learning rate ratio (Hayou et al., ICML 2024). When set, the B matrices use lr * loraplus_lr_ratio while A matrices use the base lr. Typical values: 2-16. Provides 1-2%% accuracy gain and up to 2x speedup over standard LoRA. Paper: https://arxiv.org/abs/2402.12354
init_lora_weights (default: default): Initialization strategy for LoRA weight matrices. 'default' uses the standard Kaiming uniform init (A) and zeros (B). 'gaussian' uses Gaussian init for A. 'pissa' (Principal Singular values and Singular vectors Adaptation) initializes using SVD of the pretrained weight, converging faster and often outperforming standard LoRA. Paper: https://arxiv.org/abs/2404.02948. 'eva' (Explained Variance Adaptation) initializes from the SVD of layer input activations — requires eva_config to be set. Paper: https://arxiv.org/abs/2410.07170. 'corda' (Context-Oriented Decomposition Adaptation) combines PiSSA and full fine-tuning signals, converging faster than PiSSA. Paper: https://arxiv.org/abs/2406.05223. 'olora' (Orthonormal LoRA) uses QR decomposition for better conditioning. 'loftq' (LoftQ) jointly quantizes base weights and initializes LoRA — requires loftq_config to be set. Paper: https://arxiv.org/abs/2310.08659. 'orthogonal' uses orthogonal initialization. Options: default, gaussian, eva, olora, pissa, corda, loftq, orthogonal.
eva_config (default: null):
loftq_config (default: null):
rank_pattern (default: null): Per-layer rank overrides as a mapping of layer name (or regex) to rank integer. Overrides the global r for matched layers. Useful for LoRA-XS style configurations where different layers benefit from different ranks. Example: {'model.layers.0.self_attn.q_proj': 4, 'model.layers.0.self_attn.v_proj': 2}
alpha_pattern (default: null): Per-layer alpha (scaling) overrides as a mapping of layer name (or regex) to float. Overrides the global alpha for matched layers.
layer_replication (default: null): Layer replication configuration as a list of [start, end] index pairs. Enables depth-wise parameter efficiency by sharing LoRA weights across layer ranges. Example: [[0, 4], [2, 5]] creates two overlapping groups.

AdaLoRA¶

AdaLoRA is an extension of LoRA that allows the model to adapt the pretrained parameters to the downstream task in a task-specific manner. This is done by adding a small number of trainable parameters to the model, which are used to adapt the pretrained parameters to the downstream task. This allows the model to be fine-tuned with a much smaller number of training examples, and can even be used to fine-tune models on tasks that have no training data available at all.

adapter:
    type: adalora
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    r: 8
    alpha: 16
    dropout: 0.05
    bias_type: none
    target_modules: null
    use_rslora: false
    use_dora: false
    loraplus_lr_ratio: null
    init_lora_weights: default
    eva_config: null
    loftq_config: null
    rank_pattern: null
    alpha_pattern: null
    layer_replication: null
    target_r: 8
    init_r: 12
    tinit: 0
    tfinal: 0
    delta_t: 1
    beta1: 0.85
    beta2: 0.85
    orth_reg_weight: 0.5
    total_step: 10000

pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
r (default: 8):
alpha (default: null):
dropout (default: 0.05):
bias_type (default: none):
target_modules (default: null):
use_rslora (default: false):
use_dora (default: false):
loraplus_lr_ratio (default: null):
init_lora_weights (default: default):
eva_config (default: null):
loftq_config (default: null):
rank_pattern (default: null): The allocated rank for each weight matrix by RankAllocator.
alpha_pattern (default: null):
layer_replication (default: null):
target_r (default: 8): Target Lora Matrix Dimension. The target average rank of incremental matrix.
init_r (default: 12): Initial Lora Matrix Dimension. The initial rank for each incremental matrix.
tinit (default: 0): The steps of initial fine-tuning warmup.
tfinal (default: 0): The steps of final fine-tuning warmup.
delta_t (default: 1): The time internval between two budget allocations. The step interval of rank allocation.
beta1 (default: 0.85): The hyperparameter of EMA for sensitivity smoothing.
beta2 (default: 0.85): The hyperparameter of EMA for undertainty quantification.
orth_reg_weight (default: 0.5): The coefficient of orthogonality regularization.
total_step (default: 10000): The total training steps for AdaLoRA rank allocation scheduling. Must be a positive integer (required by peft >= 0.14).

IA3¶

Infused Adapter by Inhibiting and Amplifying Inner Activations, or IA3, is a method that adds three learned vectors l_k``,l_v`, andl_ff`, to rescale the keys and values of the self-attention and encoder-decoder attention layers, and the intermediate activation of the position-wise feed-forward network respectively. These learned vectors are the only trainable parameters during fine-tuning, and thus the original weights remain frozen. Dealing with learned vectors (as opposed to learned low-rank updates to a weight matrix like LoRA) keeps the number of trainable parameters much smaller.

adapter:
    type: ia3
    target_modules: null
    feedforward_modules: null
    fan_in_fan_out: false
    modules_to_save: null
    init_ia3_weights: true
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false

target_modules (default: null) : The names of the modules to apply (IA)^3 to.
feedforward_modules (default: null) : The names of the modules to be treated as feedforward modules, as in the original paper. These modules will have (IA)^3 vectors multiplied to the input, instead of the output. feedforward_modules must be a name or a subset of names present in target_modules.
fan_in_fan_out (default: false) : Set this to True if the layer to replace stores weight like (fan_in, fan_out). For example, gpt-2 uses Conv1D which stores weights like (fan_in, fan_out) and hence this should be set to True.
modules_to_save (default: null) : List of modules apart from (IA)^3 layers to be set as trainable and saved in the final checkpoint.
init_ia3_weights (default: true) : Whether to initialize the vectors in the (IA)^3 layers, defaults to True.
pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.

VeRA¶

Vector-based Random Matrix Adaptation. Shares frozen random matrices across layers; only small scaling vectors are trained, giving 10× fewer parameters than LoRA at the same rank.

adapter:
    type: vera
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    r: 256
    target_modules: null
    projection_prng_key: 0

pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
r (default: 256): VeRA rank dimension.
target_modules (default: null): List of module names to apply VeRA to.
projection_prng_key (default: 0): PRNG key for shared random projection matrices.

LoHa¶

Low-Rank Hadamard Product Adaptation. Uses a Hadamard product of two low-rank matrices to capture more complex weight updates than LoRA at the same rank.

adapter:
    type: loha
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    r: 8
    alpha: 8
    target_modules: null

pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
r (default: 8): LoHa rank dimension.
alpha (default: 8): Scaling factor for LoHa.
target_modules (default: null): List of module names to apply LoHa to.

LoKr¶

Low-Rank Kronecker Product Adaptation. Uses Kronecker product decomposition for efficient weight updates with a different inductive bias than LoRA.

adapter:
    type: lokr
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    r: 8
    alpha: 8
    target_modules: null

pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
r (default: 8): LoKr rank dimension.
alpha (default: 8): Scaling factor for LoKr.
target_modules (default: null): List of module names to apply LoKr to.

FourierFT¶

Frequency-domain fine-tuning. Learns weight updates in the Fourier frequency domain, providing a complementary inductive bias to spatial methods like LoRA.

adapter:
    type: fourierft
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    n_frequency: 1000
    scaling: 150.0
    target_modules: null

pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
n_frequency (default: 1000): Number of frequency components.
scaling (default: 150.0): Scaling factor for FourierFT.
target_modules (default: null): List of module names to apply FourierFT to.

BOFT¶

Butterfly Orthogonal Fine-Tuning. Learns orthogonal transformations via butterfly factorization, preserving the pre-trained model's geometry while adapting to new tasks.

adapter:
    type: boft
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    boft_block_size: 4
    boft_n_butterfly_factor: 1
    boft_dropout: 0.05
    target_modules: null

pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
boft_block_size (default: 4): Block size for butterfly factorization.
boft_n_butterfly_factor (default: 1): Number of butterfly factors.
boft_dropout (default: 0.05): Dropout for BOFT layers.
target_modules (default: null): List of module names to apply BOFT to.

TinyLoRA¶

TinyLoRA: extreme parameter-efficient fine-tuning via SVD projection (LoRA-XS variant).

adapter:
    type: tinylora
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    r: 2
    u: 64
    weight_tying: 0.0
    projection_seed: 42
    save_projection: true
    tinylora_dropout: 0.0
    target_modules: null

pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
r (default: 2): SVD rank for the frozen U, Sigma, V decomposition. The paper recommends r=2.
u (default: 64): Trainable vector dimension per group. Controls the expressivity of the adaptation. Can be as low as 1–13 for extreme parameter efficiency.
weight_tying (default: 0.0): Degree of weight tying across target modules (0.0 = no sharing, 1.0 = full sharing). Sharing trainable vectors across modules further reduces parameter count.
projection_seed (default: 42): Random seed for generating the fixed projection matrices.
save_projection (default: true): Whether to save the projection tensors in the state dict.
tinylora_dropout (default: 0.0): Dropout probability for TinyLoRA layers.
target_modules (default: null): List of module names or regex to apply TinyLoRA to.

C3A¶

C3A: context-aware block-diagonal adapter for multi-task and compositional fine-tuning.

adapter:
    type: c3a
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    block_size: 256
    target_modules: null
    bias_type: none

pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
block_size (default: 256): Block size for C3A, must be divisible by both the input size and the output size of each target layer. Setting this to the GCD of all target layer dimensions is a safe default. Larger block sizes mean fewer parameters.
target_modules (default: null): List of module names or regex to apply C3A to.
bias_type (default: none): Bias type for C3A. 'none' trains no biases; 'all' or 'c3a_only' trains the adapter biases. Options: none, all, c3a_only.

OFT¶

OFT: Orthogonal Fine-Tuning that preserves hyperspherical energy of the pre-trained model.

adapter:
    type: oft
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    r: 0
    oft_block_size: 32
    module_dropout: 0.0
    target_modules: null
    coft: false
    eps: 6.0e-05

pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
r (default: 0): OFT rank. When 0, the block size (oft_block_size) controls the granularity instead. Cannot be set simultaneously with oft_block_size.
oft_block_size (default: 32): Block size for the butterfly factorization of the orthogonal transform.
module_dropout (default: 0.0): Probability of randomly zeroing an OFT block during training.
target_modules (default: null): List of module names or regex to apply OFT to.
coft (default: false): Whether to use Constrained OFT (COFT), which enforces the constraint ||I - R^T R||_F <= eps.
eps (default: 6e-05): Constraint strength for COFT (only used when coft=True).

HRA¶

HRA: Householder Reflection Adaptation — orthogonal updates via Householder reflections.

adapter:
    type: hra
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    r: 8
    apply_GS: false
    target_modules: null

pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
r (default: 8): Number of Householder reflections (rank). More reflections = more expressive adaptation.
apply_GS (default: false): Whether to apply Gram-Schmidt orthogonalization to the Householder vectors. Improves numerical stability at the cost of a small overhead.
target_modules (default: null): List of module names or regex to apply HRA to.

WaveFT¶

WaveFT: Wavelet-domain fine-tuning with structured frequency-domain weight updates.

adapter:
    type: waveft
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    n_frequency: 2592
    scaling: 25.0
    wavelet_family: db1
    target_modules: null

pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
n_frequency (default: 2592): Number of wavelet frequency components to learn. Fewer = more parameter efficient.
scaling (default: 25.0): Scaling factor applied to the wavelet-domain updates.
wavelet_family (default: db1): Wavelet family to use for the discrete wavelet transform. 'db1'/'haar' are simplest; higher-order Daubechies ('db2', 'db3') capture smoother features. Options: db1, db2, db3, haar, sym2, coif1.
target_modules (default: null): List of module names or regex to apply WaveFT to.

LN-Tuning¶

LN-Tuning: tunes only the layer normalization parameters for ultra-lightweight adaptation.

adapter:
    type: ln_tuning
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    target_modules: null

pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
target_modules (default: null): List of layer norm module names or regex to tune. Defaults to all LayerNorm / RMSNorm modules in the model.

VBLoRA¶

VBLoRA: Vector Bank LoRA that shares vectors across layers for extreme compression.

adapter:
    type: vblora
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    r: 4
    num_vectors: 256
    vector_length: 256
    topk: 2
    vblora_dropout: 0.0
    save_only_topk_weights: false
    target_modules: null

pretrained_adapter_weights (default: null):
postprocessor (default: null):
postprocessor.merge_adapter_into_base_model (default: false): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately).
postprocessor.progressbar (default: false): Instructs whether or not to show a progress bar indicating the unload and merge process.
r (default: 4): LoRA rank dimension. Controls the bottleneck size of each adaptation.
num_vectors (default: 256): Number of vectors in the global vector bank shared across all layers.
vector_length (default: 256): Length (dimension) of each vector in the bank. Usually set to the hidden size or head dim.
topk (default: 2): Number of top-k vectors selected from the bank for each LoRA matrix column. Higher k increases expressivity but also parameter count.
vblora_dropout (default: 0.0): Dropout probability for VBLoRA layers.
save_only_topk_weights (default: false): Whether to save only the top-k selection logits rather than the full bank weights.
target_modules (default: null): List of module names or regex to apply VBLoRA to.

More information about the adapter config can be found here.

Quantization¶

Attention

Quantized fine-tuning currently requires using adapter: lora. In-context learning does not have this restriction.

Attention

Quantization is currently only supported with backend: local.

quantization:
    bits: 4
    backend: bitsandbytes
    mode: null
    qat: false
    llm_int8_threshold: 6.0
    llm_int8_has_fp16_weight: false
    bnb_4bit_compute_dtype: float16
    bnb_4bit_use_double_quant: true
    bnb_4bit_quant_type: nf4

bits (default: 4) : The quantization level to apply to weights on load. Options: 4, 8.
backend (default: bitsandbytes): Quantization backend. 'bitsandbytes' (default) applies 4-bit / 8-bit quantization at model load time via the bitsandbytes library — the existing QLoRA fine-tuning path. 'torchao' applies PyTorch-native quantization via torchao after model load, and can additionally run quantization-aware training (QAT) when qat: true is set. Options: bitsandbytes, torchao.
mode (default: null): torchao-only quantization mode. Ignored when backend is 'bitsandbytes'. 'int4_weight_only' and 'int8_weight_only' quantize only the weight matrices (activations stay in fp16/bf16). 'int8_dynamic' quantizes activations to int8 dynamically per-forward. 'float8' stores weights in fp8 (useful on H100+). Options: int4_weight_only, int8_weight_only, int8_dynamic, float8, null.
qat (default: false): torchao-only. When true, inserts fake-quant observers into the model before training (QAT). The model is trained in the target low-precision numerical regime, then converted to actually-quantized weights at save time. Ignored when backend is 'bitsandbytes'.
llm_int8_threshold (default: 6.0): This corresponds to the outlier threshold for outlier detection as described in LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale paper: https://arxiv.org/abs/2208.07339. Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).
llm_int8_has_fp16_weight (default: false): This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not have to be converted back and forth for the backward pass.
bnb_4bit_compute_dtype (default: float16): This sets the computational type which might be different than the input type. For example, inputs might be fp32, but computation can be set to bf16 for speedups. Options: float32, float16, bfloat16.
bnb_4bit_use_double_quant (default: true): This flag is used for nested quantization where the quantization constants from the first quantization are quantized again.
bnb_4bit_quant_type (default: nf4): This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options: fp4, nf4.

More information about quantization parameters can be found here.

Model Parameters¶

More information about the model initialization parameters can be found here.

Output Features¶

Text output features are a special case of Sequence Features, so all options of sequence features are available for text features as well.

Text output features can be used for either tagging (classifying each token of an input sequence) or text generation (generating text by repeatedly sampling from the model). There are two decoders available for these tasks named tagger and generator respectively.

Example text output feature using default parameters:

name: text_column_name
type: text
reduce_input: null
dependencies: []
reduce_dependencies: sum
loss:
    type: softmax_cross_entropy
    confidence_penalty: 0
    robust_lambda: 0
    class_weights: 1
    class_similarities_temperature: 0
decoder:
    type: generator

Parameters:

reduce_input (default sum): defines how to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are: sum, mean or avg, max, concat (concatenates along the sequence dimension), last (returns the last vector of the sequence dimension).
dependencies (default []): the output features this one is dependent on. For a detailed explanation refer to Output Feature Dependencies.
reduce_dependencies (default sum): defines how to reduce the output of a dependent feature that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are: sum, mean or avg, max, concat (concatenates along the sequence dimension), last (returns the last vector of the sequence dimension).
loss (default {type: softmax_cross_entropy, class_similarities_temperature: 0, class_weights: 1, confidence_penalty: 0, robust_lambda: 0}): is a dictionary containing a loss type. The only available loss type for text features is softmax_cross_entropy. See Loss for details.
decoder (default: {"type": "generator"}): Decoder for the desired task. Options: generator, tagger. See Decoder for details.

Decoder type and decoder parameters can also be defined once and applied to all text output features using the Type-Global Decoder section. Loss and loss related parameters can also be defined once in the same way.

Decoders¶

Generator¶

graph LR
  A["Combiner Output"] --> B["Fully\n Connected\n Layers"];
  B --> C1["RNN"] --> C2["RNN"] --> C3["RNN"];
  GO(["GO"]) -.-o C1;
  C1 -.-o O1("Output");
  O1 -.-o C2;
  C2 -.-o O2("Output");
  O2 -.-o C3;
  C3 -.-o END(["END"]);
  subgraph DEC["DECODER.."]
  B
  C1
  C2
  C3
  end

In the case of generator the decoder is a (potentially empty) stack of fully connected layers, followed by an RNN that generates outputs feeding on its own previous predictions and generates a tensor of size b x s' x c, where b is the batch size, s' is the length of the generated sequence and c is the number of classes, followed by a softmax_cross_entropy. During training teacher forcing is adopted, meaning the list of targets is provided as both inputs and outputs (shifted by 1), while at evaluation time greedy decoding (generating one token at a time and feeding it as input for the next step) is performed by beam search, using a beam of 1 by default. In general a generator expects a b x h shaped input tensor, where h is a hidden dimension. The h vectors are (after an optional stack of fully connected layers) fed into the rnn generator. One exception is when the generator uses attention, as in that case the expected size of the input tensor is b x s x h, which is the output of a sequence, text or time series input feature without reduced outputs or the output of a sequence-based combiner. If a b x h input is provided to a generator decoder using an RNN with attention instead, an error will be raised during model building.

decoder:
    type: generator
    cell_type: gru
    num_layers: 1
    reduce_input: sum
    fc_layers: null
    num_fc_layers: 0
    fc_output_size: 256
    fc_use_bias: true
    fc_weights_initializer: xavier_uniform
    fc_bias_initializer: zeros
    fc_norm: null
    fc_norm_params: null
    fc_activation: relu
    fc_dropout: 0.0
    teacher_forcing_decay: none
    teacher_forcing_decay_rate: 0.01
    beam_width: 1
    beam_length_penalty: 1.0

Parameters:

cell_type (default: gru) : Type of recurrent cell to use. Options: rnn, lstm, gru.
num_layers (default: 1) : The number of stacked recurrent layers.
reduce_input (default: sum): How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension) Options: sum, mean, avg, max, concat, last.
fc_layers (default: null):
num_fc_layers (default: 0):
fc_output_size (default: 256):
fc_use_bias (default: true):
fc_weights_initializer (default: xavier_uniform):
fc_bias_initializer (default: zeros):
fc_norm (default: null):
fc_norm_params (default: null):
fc_activation (default: relu):
fc_dropout (default: 0.0):
teacher_forcing_decay (default: none): Decay schedule for teacher forcing probability during training. none always uses full teacher forcing; linear linearly decays the probability to zero; exponential applies exponential decay. Implements scheduled sampling (Bengio et al., NeurIPS 2015). Options: none, linear, exponential.
teacher_forcing_decay_rate (default: 0.01): Rate of decay for the teacher forcing probability per decoding step when teacher_forcing_decay is linear or exponential.
beam_width (default: 1): Width of the beam for beam search decoding. 1 = greedy decoding (default). Values > 1 enable beam search at inference time, keeping the top beam_width candidate sequences at each step.
beam_length_penalty (default: 1.0): Length penalty exponent applied to beam search scores. Score = log_prob / (length ^ beam_length_penalty). Values > 1 penalise longer sequences; values < 1 favour them. Only used when beam_width > 1.

Tagger¶

graph LR
  A["emb[0]\n....\nemb[n]"] --> B["Fully\n Connected\n Layers"];
  B --> C["Projection\n....\nProjection"];
  C --> D["Softmax\n....\nSoftmax"];
  subgraph DEC["DECODER.."]
  B
  C
  D
  end
  subgraph COM["COMBINER OUT.."]
  A
  end

In the case of tagger the decoder is a (potentially empty) stack of fully connected layers, followed by a projection into a tensor of size b x s x c, where b is the batch size, s is the length of the sequence and c is the number of classes, followed by a softmax_cross_entropy. This decoder requires its input to be shaped as b x s x h, where h is a hidden dimension, which is the output of a sequence, text or time series input feature without reduced outputs or the output of a sequence-based combiner. If a b x h input is provided instead, an error will be raised during model building.

decoder:
    type: tagger
    attention_embedding_size: 256
    use_attention: false
    use_bias: true
    attention_num_heads: 8
    fc_layers: null
    num_fc_layers: 0
    fc_output_size: 256
    fc_use_bias: true
    fc_weights_initializer: xavier_uniform
    fc_bias_initializer: zeros
    fc_norm: null
    fc_norm_params: null
    fc_activation: relu
    fc_dropout: 0.0

Parameters:

attention_embedding_size (default: 256): The embedding size of the multi-head self attention layer.
use_attention (default: false): Whether to apply a multi-head self attention layer before prediction.
use_bias (default: true): Whether the layer uses a bias vector.
attention_num_heads (default: 8): The number of attention heads in the multi-head self attention layer.
fc_layers (default: null):
num_fc_layers (default: 0):
fc_output_size (default: 256):
fc_use_bias (default: true):
fc_weights_initializer (default: xavier_uniform):
fc_bias_initializer (default: zeros):
fc_norm (default: null):
fc_norm_params (default: null):
fc_activation (default: relu):
fc_dropout (default: 0.0):

Loss¶

Sequence Softmax Cross Entropy¶

loss:
    type: sequence_softmax_cross_entropy
    class_weights: null
    weight: 1.0
    robust_lambda: 0
    confidence_penalty: 0
    class_similarities: null
    class_similarities_temperature: 0
    unique: false

Parameters:

class_weights (default: null) : Weights to apply to each class in the loss. If not specified, all classes are weighted equally. The value can be a vector of weights, one for each class, that is multiplied to the loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in case of unbalanced class distribution. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the <UNK> class needs to be included too). Alternatively, the value can be a dictionary with class strings as keys and weights as values, like {class_a: 0.5, class_b: 0.7, ...}.
weight (default: 1.0): Weight of the loss.
robust_lambda (default: 0): Replaces the loss with (1 - robust_lambda) * loss + robust_lambda / c where c is the number of classes. Useful in case of noisy labels.
confidence_penalty (default: 0): Penalizes overconfident predictions (low entropy) by adding an additional term that penalizes too confident predictions by adding a a * (max_entropy - entropy) / max_entropy term to the loss, where a is the value of this parameter. Useful in case of noisy labels.
class_similarities (default: null): If not null it is a c x c matrix in the form of a list of lists that contains the mutual similarity of classes. It is used if class_similarities_temperature is greater than 0. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the <UNK> class needs to be included too).
class_similarities_temperature (default: 0): The temperature parameter of the softmax that is performed on each row of class_similarities. The output of that softmax is used to determine the supervision vector to provide instead of the one hot vector that would be provided otherwise for each datapoint. The intuition behind it is that errors between similar classes are more tolerable than errors between really different classes.
unique (default: false): If true, the loss is only computed for unique elements in the sequence.

Metrics¶

The metrics available for text features are the same as for Sequence Features:

sequence_accuracy The rate at which the model predicted the correct sequence.
token_accuracy The number of tokens correctly predicted divided by the total number of tokens in all sequences.
last_accuracy Accuracy considering only the last element of the sequence. Useful to ensure special end-of-sequence tokens are generated or tagged.
edit_distance Levenshtein distance: the minimum number of single-token edits (insertions, deletions or substitutions) required to change predicted sequence to ground truth.
perplexity Perplexity is the inverse of the predicted probability of the ground truth sequence, normalized by the number of tokens. The lower the perplexity, the higher the probability of predicting the true sequence.
loss The value of the loss function.

You can set any of the above as validation_metric in the training section of the configuration if validation_field names a sequence feature.