Skip to content

⇅ Sequence Features

Preprocessing

Sequence features are transformed into an integer valued matrix of size n x l (where n is the number of rows and l is the minimum of the length of the longest sequence and a max_sequence_length parameter) and added to HDF5 with a key that reflects the name of column in the dataset. Each sequence in mapped to a list of integers internally. First, a tokenizer converts each sequence to a list of tokens (default tokenization is done by splitting on spaces). Next, a dictionary is constructed which maps each unique token to its frequency in the dataset column. Tokens are ranked by frequency and a sequential integer ID is assigned from the most frequent to the most rare. Ludwig uses <PAD>, <UNK>, <SOS>, and <EOS> special symbols for padding, unknown, start, and end, consistent with common NLP deep learning practices. Special symbols can also be set manually in the preprocessing config. The column name is added to the JSON file, with an associated dictionary containing

  1. the mapping from integer to string (idx2str)
  2. the mapping from string to id (str2idx)
  3. the mapping from string to frequency (str2freq)
  4. the maximum length of all sequences (max_sequence_length)
  5. additional preprocessing information (how to fill missing values and what token to use to fill missing values)
preprocessing:
    tokenizer: space
    sequence_length: null
    max_sequence_length: 256
    missing_value_strategy: fill_with_const
    most_common: 20000
    lowercase: false
    fill_value: <UNK>
    ngram_size: 2
    padding_symbol: <PAD>
    unknown_symbol: <UNK>
    padding: right
    cache_encoder_embeddings: false
    vocab_file: null

Parameters:

  • tokenizer (default: space) : Defines how to map from the raw string content of the dataset column to a sequence of elements.
  • sequence_length (default: null) : The desired length (number of tokens) of the sequence. Sequences that are longer than this value will be truncated and sequences shorter than this value will be padded. If None, sequence length will be inferred from the training dataset.
  • max_sequence_length (default: 256) : The maximum length (number of tokens) of the sequence. Sequences that are longer than this value will be truncated. Useful as a stopgap measure if sequence_length is set to None. If None, max sequence length will be inferred from the training dataset.
  • missing_value_strategy (default: fill_with_const) : What strategy to follow when there's a missing value in a text column Options: fill_with_const, fill_with_mode, bfill, ffill, drop_row. See Missing Value Strategy for details.
  • most_common (default: 20000): The maximum number of most common tokens in the vocabulary. If the data contains more than this amount, the most infrequent symbols will be treated as unknown.
  • lowercase (default: false): If true, converts the string to lowercase before tokenizing. Options: true, false.
  • fill_value (default: <UNK>): The value to replace missing values with in case the missing_value_strategy is fill_with_const
  • ngram_size (default: 2): The size of the ngram when using the ngram tokenizer (e.g, 2 = bigram, 3 = trigram, etc.).
  • padding_symbol (default: <PAD>): The string used as a padding symbol. This special token is mapped to the integer ID 0 in the vocabulary.
  • unknown_symbol (default: <UNK>): The string used as an unknown placeholder. This special token is mapped to the integer ID 1 in the vocabulary.
  • padding (default: right): The direction of the padding. Options: left, right.
  • cache_encoder_embeddings (default: false): Compute encoder embeddings in preprocessing, speeding up training time considerably. Options: true, false.
  • vocab_file (default: null): Filepath string to a UTF-8 encoded file containing the sequence's vocabulary. On each line the first string until or is considered a word.

Preprocessing parameters can also be defined once and applied to all sequence input features using the Type-Global Preprocessing section.

Input Features

Sequence features have several encoders and each of them has its own parameters. Inputs are of size b while outputs are of size b x h where b is the batch size and h is the dimensionality of the output of the encoder. In case a representation for each element of the sequence is needed (for example for tagging them, or for using an attention mechanism), one can specify the parameter reduce_output to be null and the output will be a b x s x h tensor where s is the length of the sequence. Some encoders, because of their inner workings, may require additional parameters to be specified in order to obtain one representation for each element of the sequence. For instance the parallel_cnn encoder by default pools and flattens the sequence dimension and then passes the flattened vector through fully connected layers, so in order to obtain the full sequence tensor one has to specify reduce_output: null.

The encoder parameters specified at the feature level are:

  • tied (default null): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.

Example sequence feature entry in the input features list:

name: sequence_column_name
type: sequence
tied: null
encoder: 
    type: stacked_cnn

The available encoder parameters:

  • type (default parallel_cnn): the name of the encoder to use to encode the sequence, one of embed, parallel_cnn, stacked_cnn, stacked_parallel_cnn, rnn, cnnrnn, transformer and passthrough (equivalent to null or None).

Encoder type and encoder parameters can also be defined once and applied to all sequence input features using the Type-Global Encoder section.

Encoders

Embed Encoder

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["Aggregation\n Reduce\n Operation"];
  C --> ...;

The embed encoder simply maps each token in the input sequence to an embedding, creating a b x s x h tensor where b is the batch size, s is the length of the sequence and h is the embedding size. The tensor is reduced along the s dimension to obtain a single vector of size h for each element of the batch. If you want to output the full b x s x h tensor, you can specify reduce_output: null.

encoder:
    type: embed
    dropout: 0.0
    embedding_size: 256
    representation: dense
    weights_initializer: uniform
    reduce_output: sum
    embeddings_on_cpu: false
    embeddings_trainable: true
    pretrained_embeddings: null

Parameters:

  • dropout (default: 0.0) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
  • representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
  • weights_initializer (default: uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • reduce_output (default: sum): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.

  • embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.
  • embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.

  • pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

Parallel CNN Encoder

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  C --> D1["1D Conv\n Width 2"] --> E1["Pool"];
  C --> D2["1D Conv\n Width 3"] --> E2["Pool"];
  C --> D3["1D Conv\n Width 4"] --> E3["Pool"];
  C --> D4["1D Conv\n Width 5"] --> E4["Pool"];
  E1 --> F["Concat"] --> G["Fully\n Connected\n Layers"] --> H["..."];
  E2 --> F;
  E3 --> F;
  E4 --> F;

The parallel cnn encoder is inspired by Yoon Kim's Convolutional Neural Network for Sentence Classification. It works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a number of parallel 1d convolutional layers with different filter size (by default 4 layers with filter size 2, 3, 4 and 5), followed by max pooling and concatenation. This single vector concatenating the outputs of the parallel convolutional layers is then passed through a stack of fully connected layers and returned as a b x h tensor where h is the output size of the last fully connected layer. If you want to output the full b x s x h tensor, you can specify reduce_output: null.

encoder:
    type: parallel_cnn
    dropout: 0.0
    embedding_size: 256
    num_conv_layers: null
    output_size: 256
    activation: relu
    filter_size: 3
    norm: null
    representation: dense
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: sum
    conv_layers: null
    pool_function: max
    pool_size: null
    norm_params: null
    num_fc_layers: null
    fc_layers: null
    pretrained_embeddings: null
    num_filters: 256

Parameters:

  • dropout (default: 0.0) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
  • num_conv_layers (default: null) : The number of stacked convolutional layers when conv_layers is null.
  • output_size (default: 256) : The default output_size that will be used for each layer.
  • activation (default: relu): The default activation function that will be used for each layer. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
  • filter_size (default: 3): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.
  • norm (default: null): The default norm that will be used for each layer. Options: batch, layer, null.
  • representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
  • use_bias (default: true): Whether to use a bias vector. Options: true, false.
  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.

  • embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.
  • reduce_output (default: sum): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
  • conv_layers (default: null): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, num_filters, filter_size, strides, padding, dilation_rate, use_bias, pool_function, pool_padding, pool_size, pool_strides, bias_initializer, weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If both conv_layers and num_conv_layers are null, a default list will be assigned to conv_layers with the value [{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}].

  • pool_function (default: max): Pooling function to use. max will select the maximum value. Any of average, avg, or mean will compute the mean value Options: last, sum, mean, avg, max, concat, attention, none, None, null.

  • pool_size (default: null): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the s sequence dimension after the convolution operation.
  • norm_params (default: null): Parameters used if norm is either batch or layer.
  • num_fc_layers (default: null): Number of parallel fully connected layers to use.
  • fc_layers (default: null): List of dictionaries containing the parameters for each fully connected layer.

  • pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

  • num_filters (default: 256): Number of filters, and by consequence number of output channels of the 1d convolution.

Stacked CNN Encoder

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["1D Conv Layers\n Different Widths"];
  C --> D["Fully\n Connected\n Layers"];
  D --> ...;

The stacked cnn encoder is inspired by Xiang Zhang at all's Character-level Convolutional Networks for Text Classification. It works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of 1d convolutional layers with different filter size (by default 6 layers with filter size 7, 7, 3, 3, 3 and 3), followed by an optional final pool and by a flatten operation. This single flatten vector is then passed through a stack of fully connected layers and returned as a b x h tensor where h is the output size of the last fully connected layer. If you want to output the full b x s x h tensor, you can specify the pool_size of all your conv_layers to be null and reduce_output: null, while if pool_size has a value different from null and reduce_output: null the returned tensor will be of shape b x s' x h, where s' is width of the output of the last convolutional layer.

encoder:
    type: stacked_cnn
    dropout: 0.0
    num_conv_layers: null
    embedding_size: 256
    output_size: 256
    activation: relu
    filter_size: 3
    strides: 1
    norm: null
    representation: dense
    conv_layers: null
    pool_function: max
    pool_size: null
    dilation_rate: 1
    pool_strides: null
    pool_padding: same
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: sum
    norm_params: null
    num_fc_layers: null
    fc_layers: null
    num_filters: 256
    padding: same
    pretrained_embeddings: null

Parameters:

  • dropout (default: 0.0) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • num_conv_layers (default: null) : The number of stacked convolutional layers when conv_layers is null.
  • embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
  • output_size (default: 256) : The default output_size that will be used for each layer.
  • activation (default: relu): The default activation function that will be used for each layer. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
  • filter_size (default: 3): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.
  • strides (default: 1): Stride length of the convolution.
  • norm (default: null): The default norm that will be used for each layer. Options: batch, layer, null.
  • representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
  • conv_layers (default: null): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, num_filters, filter_size, strides, padding, dilation_rate, use_bias, pool_function, pool_padding, pool_size, pool_strides, bias_initializer, weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If both conv_layers and num_conv_layers are null, a default list will be assigned to conv_layers with the value [{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}].

  • pool_function (default: max): Pooling function to use. max will select the maximum value. Any of average, avg, or mean will compute the mean value Options: last, sum, mean, avg, max, concat, attention, none, None, null.

  • pool_size (default: null): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the s sequence dimension after the convolution operation.
  • dilation_rate (default: 1): Dilation rate to use for dilated convolution.
  • pool_strides (default: null): Factor to scale down.
  • pool_padding (default: same): Padding to use. Options: valid, same.
  • use_bias (default: true): Whether to use a bias vector. Options: true, false.
  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.

  • embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.
  • reduce_output (default: sum): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
  • norm_params (default: null): Parameters used if norm is either batch or layer.
  • num_fc_layers (default: null): Number of parallel fully connected layers to use.
  • fc_layers (default: null): List of dictionaries containing the parameters for each fully connected layer.

  • num_filters (default: 256): Number of filters, and by consequence number of output channels of the 1d convolution.

  • padding (default: same): Padding to use. Options: valid, same.

  • pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

Stacked Parallel CNN Encoder

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  C --> D1["1D Conv\n Width 2"] --> E["Concat"];
  C --> D2["1D Conv\n Width 3"] --> E;
  C --> D3["1D Conv\n Width 4"] --> E;
  C --> D4["1D Conv\n Width 5"] --> E;
  E --> F["..."];
  F --> G1["1D Conv\n Width 2"] --> H["Concat"];
  F --> G2["1D Conv\n Width 3"] --> H;
  F --> G3["1D Conv\n Width 4"] --> H;
  F --> G4["1D Conv\n Width 5"] --> H;
  H --> I["Pool"] --> J["Fully\n Connected\n Layers"] --> K["..."];

The stacked parallel cnn encoder is a combination of the Parallel CNN and the Stacked CNN encoders where each layer of the stack is composed of parallel convolutional layers. It works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of several parallel 1d convolutional layers with different filter size, followed by an optional final pool and by a flatten operation. This single flattened vector is then passed through a stack of fully connected layers and returned as a b x h tensor where h is the output size of the last fully connected layer. If you want to output the full b x s x h tensor, you can specify reduce_output: null.

encoder:
    type: stacked_parallel_cnn
    dropout: 0.0
    embedding_size: 256
    output_size: 256
    activation: relu
    filter_size: 3
    norm: null
    representation: dense
    num_stacked_layers: null
    pool_function: max
    pool_size: null
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: sum
    norm_params: null
    num_fc_layers: null
    fc_layers: null
    stacked_layers: null
    num_filters: 256
    pretrained_embeddings: null

Parameters:

  • dropout (default: 0.0) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
  • output_size (default: 256) : The default output_size that will be used for each layer.
  • activation (default: relu): The default activation function that will be used for each layer. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
  • filter_size (default: 3): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.
  • norm (default: null): The default norm that will be used for each layer. Options: batch, layer, null.
  • representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
  • num_stacked_layers (default: null): If stacked_layers is null, this is the number of elements in the stack of parallel convolutional layers.
  • pool_function (default: max): Pooling function to use. max will select the maximum value. Any of average, avg, or mean will compute the mean value Options: last, sum, mean, avg, max, concat, attention, none, None, null.
  • pool_size (default: null): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the s sequence dimension after the convolution operation.
  • use_bias (default: true): Whether to use a bias vector. Options: true, false.
  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.

  • embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.
  • reduce_output (default: sum): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
  • norm_params (default: null): Parameters used if norm is either batch or layer.
  • num_fc_layers (default: null): Number of parallel fully connected layers to use.
  • fc_layers (default: null): List of dictionaries containing the parameters for each fully connected layer.

  • stacked_layers (default: null): a nested list of lists of dictionaries containing the parameters of the stack of parallel convolutional layers. The length of the list determines the number of stacked parallel convolutional layers, length of the sub-lists determines the number of parallel conv layers and the content of each dictionary determines the parameters for a specific layer.

  • num_filters (default: 256): Number of filters, and by consequence number of output channels of the 1d convolution.

  • pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

RNN Encoder

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["RNN Layers"];
  C --> D["Fully\n Connected\n Layers"];
  D --> ...;

The rnn encoder works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of recurrent layers (by default 1 layer), followed by a reduce operation that by default only returns the last output, but can perform other reduce functions. If you want to output the full b x s x h where h is the size of the output of the last rnn layer, you can specify reduce_output: null.

encoder:
    type: rnn
    dropout: 0.0
    cell_type: rnn
    num_layers: 1
    state_size: 256
    embedding_size: 256
    output_size: 256
    norm: null
    num_fc_layers: 0
    fc_dropout: 0.0
    recurrent_dropout: 0.0
    activation: tanh
    fc_activation: relu
    recurrent_activation: sigmoid
    representation: dense
    unit_forget_bias: true
    recurrent_initializer: orthogonal
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: last
    norm_params: null
    fc_layers: null
    bidirectional: false
    pretrained_embeddings: null

Parameters:

  • dropout (default: 0.0) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • cell_type (default: rnn) : The type of recurrent cell to use. Available values are: rnn, lstm, gru. For reference about the differences between the cells please refer to torch.nn Recurrent Layers. Options: rnn, lstm, gru.
  • num_layers (default: 1) : The number of stacked recurrent layers.
  • state_size (default: 256) : The size of the state of the rnn.
  • embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
  • output_size (default: 256) : The default output_size that will be used for each layer.
  • norm (default: null) : The default norm that will be used for each layer. Options: batch, layer, ghost, null.
  • num_fc_layers (default: 0) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
  • fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • recurrent_dropout (default: 0.0): The dropout rate for the recurrent state
  • activation (default: tanh): The default activation function. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
  • fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
  • recurrent_activation (default: sigmoid): The activation function to use in the recurrent step Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
  • representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
  • unit_forget_bias (default: true): If true, add 1 to the bias of the forget gate at initialization Options: true, false.
  • recurrent_initializer (default: orthogonal): The initializer for recurrent matrix weights Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity.
  • use_bias (default: true): Whether to use a bias vector. Options: true, false.
  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.

  • embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.
  • reduce_output (default: last): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
  • norm_params (default: null): Default parameters passed to the norm module.
  • fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.

  • bidirectional (default: false): If true, two recurrent networks will perform encoding in the forward and backward direction and their outputs will be concatenated. Options: true, false.

  • pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

CNN RNN Encoder

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C1["CNN Layers"];
  C1 --> C2["RNN Layers"];
  C2 --> D["Fully\n Connected\n Layers"];
  D --> ...;

The cnnrnn encoder works by first mapping the input token sequence b x s (where b is the batch size and s is the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of convolutional layers (by default 2), that is followed by a stack of recurrent layers (by default 1), followed by a reduce operation that by default only returns the last output, but can perform other reduce functions. If you want to output the full b x s x h where h is the size of the output of the last rnn layer, you can specify reduce_output: null.

encoder:
    type: cnnrnn
    dropout: 0.0
    conv_dropout: 0.0
    cell_type: rnn
    num_conv_layers: null
    state_size: 256
    embedding_size: 256
    output_size: 256
    norm: null
    num_fc_layers: 0
    fc_dropout: 0.0
    recurrent_dropout: 0.0
    activation: tanh
    filter_size: 5
    strides: 1
    fc_activation: relu
    recurrent_activation: sigmoid
    conv_activation: relu
    representation: dense
    conv_layers: null
    pool_function: max
    pool_size: null
    dilation_rate: 1
    pool_strides: null
    pool_padding: same
    unit_forget_bias: true
    recurrent_initializer: orthogonal
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: last
    norm_params: null
    fc_layers: null
    num_filters: 256
    padding: same
    num_rec_layers: 1
    bidirectional: false
    pretrained_embeddings: null

Parameters:

  • dropout (default: 0.0) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • conv_dropout (default: 0.0) : The dropout rate for the convolutional layers
  • cell_type (default: rnn) : The type of recurrent cell to use. Available values are: rnn, lstm, gru. For reference about the differences between the cells please refer to torch.nn Recurrent Layers. Options: rnn, lstm, gru.
  • num_conv_layers (default: null) : The number of stacked convolutional layers when conv_layers is null.
  • state_size (default: 256) : The size of the state of the rnn.
  • embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
  • output_size (default: 256) : The default output_size that will be used for each layer.
  • norm (default: null) : The default norm that will be used for each layer. Options: batch, layer, ghost, null.
  • num_fc_layers (default: 0) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
  • fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • recurrent_dropout (default: 0.0): The dropout rate for the recurrent state
  • activation (default: tanh): The default activation function to use. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
  • filter_size (default: 5): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.
  • strides (default: 1): Stride length of the convolution.
  • fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
  • recurrent_activation (default: sigmoid): The activation function to use in the recurrent step Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
  • conv_activation (default: relu): The default activation function that will be used for each convolutional layer. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
  • representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
  • conv_layers (default: null): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, num_filters, filter_size, strides, padding, dilation_rate, use_bias, pool_function, pool_padding, pool_size, pool_strides, bias_initializer, weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If both conv_layers and num_conv_layers are null, a default list will be assigned to conv_layers with the value [{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}].

  • pool_function (default: max): Pooling function to use. max will select the maximum value. Any of average, avg, or mean will compute the mean value Options: last, sum, mean, avg, max, concat, attention, none, None, null.

  • pool_size (default: null): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the s sequence dimension after the convolution operation.
  • dilation_rate (default: 1): Dilation rate to use for dilated convolution.
  • pool_strides (default: null): Factor to scale down.
  • pool_padding (default: same): Padding to use. Options: valid, same.
  • unit_forget_bias (default: true): If true, add 1 to the bias of the forget gate at initialization Options: true, false.
  • recurrent_initializer (default: orthogonal): The initializer for recurrent matrix weights Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity.
  • use_bias (default: true): Whether to use a bias vector. Options: true, false.
  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.

  • embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.
  • reduce_output (default: last): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
  • norm_params (default: null): Default parameters passed to the norm module.
  • fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.

  • num_filters (default: 256): Number of filters, and by consequence number of output channels of the 1d convolution.

  • padding (default: same): Padding to use. Options: valid, same.
  • num_rec_layers (default: 1): The number of stacked recurrent layers.
  • bidirectional (default: false): If true, two recurrent networks will perform encoding in the forward and backward direction and their outputs will be concatenated. Options: true, false.

  • pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

Transformer Encoder

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["Transformer\n Blocks"];
  C --> D["Fully\n Connected\n Layers"];
  D --> ...;

The transformer encoder implements a stack of transformer blocks, replicating the architecture introduced in the Attention is all you need paper, and adds am optional stack of fully connected layers at the end.

encoder:
    type: transformer
    dropout: 0.1
    num_layers: 1
    embedding_size: 256
    output_size: 256
    norm: null
    num_fc_layers: 0
    fc_dropout: 0.0
    hidden_size: 256
    transformer_output_size: 256
    fc_activation: relu
    representation: dense
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: last
    norm_params: null
    fc_layers: null
    num_heads: 8
    pretrained_embeddings: null

Parameters:

  • dropout (default: 0.1) : The dropout rate for the transformer block. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • num_layers (default: 1) : The number of transformer layers.
  • embedding_size (default: 256) : The maximum embedding size. The actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>, <PAD>, <SOS>, <EOS>).
  • output_size (default: 256) : The default output_size that will be used for each layer.
  • norm (default: null) : The default norm that will be used for each layer. Options: batch, layer, ghost, null.
  • num_fc_layers (default: 0) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
  • fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • hidden_size (default: 256): The size of the hidden representation within the transformer block. It is usually the same as the embedding_size, but if the two values are different, a projection layer will be added before the first transformer block.
  • transformer_output_size (default: 256): Size of the fully connected layer after self attention in the transformer block. This is usually the same as hidden_size and embedding_size.
  • fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
  • representation (default: dense): Representation of the embedding. dense means the embeddings are initialized randomly, sparse means they are initialized to be one-hot encodings. Options: dense, sparse.
  • use_bias (default: true): Whether to use a bias vector. Options: true, false.
  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • embeddings_on_cpu (default: false): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options: true, false.

  • embeddings_trainable (default: true): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense; sparse one-hot encodings are not trainable. Options: true, false.
  • reduce_output (default: last): How to reduce the output tensor along the s sequence length dimension if the rank of the tensor is greater than 2. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
  • norm_params (default: null): Default parameters passed to the norm module.
  • fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.

  • num_heads (default: 8): Number of attention heads in each transformer block.

  • pretrained_embeddings (default: null): Path to a file containing pretrained embeddings. By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

Output Features

Sequence output features can be used for either tagging (classifying each element of an input sequence) or generation (generating a sequence by sampling from the model). Ludwig provides two sequence decoders named tagger and generator respectively.

Example sequence output feature using default parameters:

name: seq_column_name
type: sequence
reduce_input: null
dependencies: []
reduce_dependencies: sum
loss:
    type: softmax_cross_entropy
    confidence_penalty: 0
    robust_lambda: 0
    class_weights: 1
    class_similarities_temperature: 0
decoder: 
    type: generator

Parameters:

  • reduce_input (default sum): defines how to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are: sum, mean or avg, max, concat (concatenates along the sequence dimension), last (returns the last vector of the sequence dimension).
  • dependencies (default []): the output features this one is dependent on. For a detailed explanation refer to Output Feature Dependencies.
  • reduce_dependencies (default sum): defines how to reduce the output of a dependent feature that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are: sum, mean or avg, max, concat (concatenates along the sequence dimension), last (returns the last vector of the sequence dimension).
  • loss (default {type: softmax_cross_entropy, class_similarities_temperature: 0, class_weights: 1, confidence_penalty: 0, robust_lambda: 0}): is a dictionary containing a loss type. The only available loss type for sequences is softmax_cross_entropy. See Loss for details.
  • decoder (default: {"type": "generator"}): Decoder for the desired task. Options: generator, tagger. See Decoder for details.

Decoder type and decoder parameters can also be defined once and applied to all sequence output features using the Type-Global Decoder section. Loss and loss related parameters can also be defined once in the same way.

Decoders

Generator

graph LR
  A["Combiner Output"] --> B["Fully\n Connected\n Layers"];
  B --> C1["RNN"] --> C2["RNN"] --> C3["RNN"];
  GO(["GO"]) -.-o C1;
  C1 -.-o O1("Output");
  O1 -.-o C2;
  C2 -.-o O2("Output");
  O2 -.-o C3;
  C3 -.-o END(["END"]);
  subgraph DEC["DECODER.."]
  B
  C1
  C2
  C3
  end

In the case of generator the decoder is a (potentially empty) stack of fully connected layers, followed by an rnn that generates outputs feeding on its own previous predictions and generates a tensor of size b x s' x c, where b is the batch size, s' is the length of the generated sequence and c is the number of classes, followed by a softmax_cross_entropy. During training teacher forcing is adopted, meaning the list of targets is provided as both inputs and outputs (shifted by 1), while at evaluation time greedy decoding (generating one token at a time and feeding it as input for the next step) is performed by beam search, using a beam of 1 by default. In general a generator expects a b x h shaped input tensor, where h is a hidden dimension. The h vectors are (after an optional stack of fully connected layers) fed into the rnn generator. One exception is when the generator uses attention, as in that case the expected size of the input tensor is b x s x h, which is the output of a sequence, text or time series input feature without reduced outputs or the output of a sequence-based combiner. If a b x h input is provided to a generator decoder using an rnn with attention instead, an error will be raised during model building.

decoder:
    type: generator
    num_fc_layers: 0
    fc_output_size: 256
    fc_norm: null
    fc_dropout: 0.0
    cell_type: gru
    num_layers: 1
    fc_activation: relu
    reduce_input: sum
    fc_layers: null
    fc_use_bias: true
    fc_weights_initializer: xavier_uniform
    fc_bias_initializer: zeros
    fc_norm_params: null

Parameters:

  • num_fc_layers (default: 0) : Number of fully-connected layers if fc_layers not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
  • fc_output_size (default: 256) : Output size of fully connected stack.
  • fc_norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null.
  • fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • cell_type (default: gru) : Type of recurrent cell to use. Options: rnn, lstm, gru.
  • num_layers (default: 1) : The number of stacked recurrent layers.
  • fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
  • reduce_input (default: sum): How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension) Options: sum, mean, avg, max, concat, last.
  • fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
  • fc_use_bias (default: true): Whether the layer uses a bias vector in the fc_stack. Options: true, false.
  • fc_weights_initializer (default: xavier_uniform): The weights initializer to use for the layers in the fc_stack
  • fc_bias_initializer (default: zeros): The bias initializer to use for the layers in the fc_stack
  • fc_norm_params (default: null): Default parameters passed to the norm module.

Tagger

graph LR
  A["emb[0]\n....\nemb[n]"] --> B["Fully\n Connected\n Layers"];
  B --> C["Projection\n....\nProjection"];
  C --> D["Softmax\n....\nSoftmax"];
  subgraph DEC["DECODER.."]
  B
  C
  D
  end
  subgraph COM["COMBINER OUT.."]
  A
  end

In the case of tagger the decoder is a (potentially empty) stack of fully connected layers, followed by a projection into a tensor of size b x s x c, where b is the batch size, s is the length of the sequence and c is the number of classes, followed by a softmax_cross_entropy. This decoder requires its input to be shaped as b x s x h, where h is a hidden dimension, which is the output of a sequence, text or time series input feature without reduced outputs or the output of a sequence-based combiner. If a b x h input is provided instead, an error will be raised during model building.

decoder:
    type: tagger
    num_fc_layers: 0
    fc_output_size: 256
    fc_norm: null
    fc_dropout: 0.0
    fc_activation: relu
    attention_embedding_size: 256
    fc_layers: null
    fc_use_bias: true
    fc_weights_initializer: xavier_uniform
    fc_bias_initializer: zeros
    fc_norm_params: null
    use_attention: false
    use_bias: true
    attention_num_heads: 8

Parameters:

  • num_fc_layers (default: 0) : Number of fully-connected layers if fc_layers not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
  • fc_output_size (default: 256) : Output size of fully connected stack.
  • fc_norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null.
  • fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
  • attention_embedding_size (default: 256): The embedding size of the multi-head self attention layer.
  • fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
  • fc_use_bias (default: true): Whether the layer uses a bias vector in the fc_stack. Options: true, false.
  • fc_weights_initializer (default: xavier_uniform): The weights initializer to use for the layers in the fc_stack
  • fc_bias_initializer (default: zeros): The bias initializer to use for the layers in the fc_stack
  • fc_norm_params (default: null): Default parameters passed to the norm module.
  • use_attention (default: false): Whether to apply a multi-head self attention layer before prediction. Options: true, false.
  • use_bias (default: true): Whether the layer uses a bias vector. Options: true, false.
  • attention_num_heads (default: 8): The number of attention heads in the multi-head self attention layer.

Loss

Sequence Softmax Cross Entropy

loss:
    type: sequence_softmax_cross_entropy
    class_weights: null
    weight: 1.0
    robust_lambda: 0
    confidence_penalty: 0
    class_similarities: null
    class_similarities_temperature: 0
    unique: false

Parameters:

  • class_weights (default: null) : Weights to apply to each class in the loss. If not specified, all classes are weighted equally. The value can be a vector of weights, one for each class, that is multiplied to the loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in case of unbalanced class distribution. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the <UNK> class needs to be included too). Alternatively, the value can be a dictionary with class strings as keys and weights as values, like {class_a: 0.5, class_b: 0.7, ...}.
  • weight (default: 1.0): Weight of the loss.
  • robust_lambda (default: 0): Replaces the loss with (1 - robust_lambda) * loss + robust_lambda / c where c is the number of classes. Useful in case of noisy labels.
  • confidence_penalty (default: 0): Penalizes overconfident predictions (low entropy) by adding an additional term that penalizes too confident predictions by adding a a * (max_entropy - entropy) / max_entropy term to the loss, where a is the value of this parameter. Useful in case of noisy labels.
  • class_similarities (default: null): If not null it is a c x c matrix in the form of a list of lists that contains the mutual similarity of classes. It is used if class_similarities_temperature is greater than 0. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the <UNK> class needs to be included too).
  • class_similarities_temperature (default: 0): The temperature parameter of the softmax that is performed on each row of class_similarities. The output of that softmax is used to determine the supervision vector to provide instead of the one hot vector that would be provided otherwise for each datapoint. The intuition behind it is that errors between similar classes are more tolerable than errors between really different classes.
  • unique (default: false): If true, the loss is only computed for unique elements in the sequence. Options: true, false.

Metrics

The metrics that are calculated every epoch and are available for sequence features are:

  • sequence_accuracy The rate at which the model predicted the correct sequence.
  • token_accuracy The number of tokens correctly predicted divided by the total number of tokens in all sequences.
  • last_accuracy Accuracy considering only the last element of the sequence. Useful to ensure special end-of-sequence tokens are generated or tagged.
  • edit_distance Levenshtein distance: the minimum number of single-token edits (insertions, deletions or substitutions) required to change predicted sequence to ground truth.
  • perplexity Perplexity is the inverse of the predicted probability of the ground truth sequence, normalized by the number of tokens. The lower the perplexity, the higher the probability of predicting the true sequence.
  • loss The value of the loss function.

You can set any of the above as validation_metric in the training section of the configuration if validation_field names a sequence feature.