# ⇅ Sequence Features

## Preprocessing¶

Sequence features are transformed into an integer valued matrix of size `n x l`

(where `n`

is the number of rows and `l`

is the minimum of the length of the longest sequence and a `max_sequence_length`

parameter) and added to HDF5 with a
key that reflects the name of column in the dataset.
Each sequence in mapped to a list of integers internally. First, a tokenizer converts each sequence to a list of tokens
(default tokenization is done by splitting on spaces).
Next, a dictionary is constructed which maps each unique token to its frequency in the dataset column. Tokens are ranked
by frequency and a sequential integer ID is assigned from the most frequent to the most rare. Ludwig uses `<PAD>`

,
`<UNK>`

, `<SOS>`

, and `<EOS>`

special symbols for padding, unknown, start, and end, consistent with common NLP deep
learning practices. Special symbols can also be set manually in the preprocessing config.
The column name is added to the JSON file, with an associated dictionary containing

- the mapping from integer to string (
`idx2str`

) - the mapping from string to id (
`str2idx`

) - the mapping from string to frequency (
`str2freq`

) - the maximum length of all sequences (
`max_sequence_length`

) - additional preprocessing information (how to fill missing values and what token to use to fill missing values)

```
preprocessing:
tokenizer: space
sequence_length: null
max_sequence_length: 256
missing_value_strategy: fill_with_const
most_common: 20000
lowercase: false
fill_value: <UNK>
ngram_size: 2
padding_symbol: <PAD>
unknown_symbol: <UNK>
padding: right
cache_encoder_embeddings: false
vocab_file: null
```

Parameters:

(default:`tokenizer`

`space`

) : Defines how to map from the raw string content of the dataset column to a sequence of elements.(default:`sequence_length`

`null`

) : The desired length (number of tokens) of the sequence. Sequences that are longer than this value will be truncated and sequences shorter than this value will be padded. If None, sequence length will be inferred from the training dataset.(default:`max_sequence_length`

`256`

) : The maximum length (number of tokens) of the sequence. Sequences that are longer than this value will be truncated. Useful as a stopgap measure if`sequence_length`

is set to`None`

. If`None`

, max sequence length will be inferred from the training dataset.(default:`missing_value_strategy`

`fill_with_const`

) : What strategy to follow when there's a missing value in a text column Options:`fill_with_const`

,`fill_with_mode`

,`bfill`

,`ffill`

,`drop_row`

. See Missing Value Strategy for details.(default:`most_common`

`20000`

): The maximum number of most common tokens in the vocabulary. If the data contains more than this amount, the most infrequent symbols will be treated as unknown.(default:`lowercase`

`false`

): If true, converts the string to lowercase before tokenizing. Options:`true`

,`false`

.(default:`fill_value`

`<UNK>`

): The value to replace missing values with in case the missing_value_strategy is fill_with_const(default:`ngram_size`

`2`

): The size of the ngram when using the`ngram`

tokenizer (e.g, 2 = bigram, 3 = trigram, etc.).(default:`padding_symbol`

`<PAD>`

): The string used as a padding symbol. This special token is mapped to the integer ID 0 in the vocabulary.(default:`unknown_symbol`

`<UNK>`

): The string used as an unknown placeholder. This special token is mapped to the integer ID 1 in the vocabulary.(default:`padding`

`right`

): The direction of the padding. Options:`left`

,`right`

.(default:`cache_encoder_embeddings`

`false`

): Compute encoder embeddings in preprocessing, speeding up training time considerably. Options:`true`

,`false`

.(default:`vocab_file`

`null`

): Filepath string to a UTF-8 encoded file containing the sequence's vocabulary. On each line the first string until or is considered a word.

Preprocessing parameters can also be defined once and applied to all sequence input features using the Type-Global Preprocessing section.

## Input Features¶

Sequence features have several encoders and each of them has its own parameters.
Inputs are of size `b`

while outputs are of size `b x h`

where `b`

is the batch size and `h`

is the dimensionality of
the output of the encoder.
In case a representation for each element of the sequence is needed (for example for tagging them, or for using an
attention mechanism), one can specify the parameter `reduce_output`

to be `null`

and the output will be a `b x s x h`

tensor where `s`

is the length of the sequence.
Some encoders, because of their inner workings, may require additional parameters to be specified in order to obtain one
representation for each element of the sequence.
For instance the `parallel_cnn`

encoder by default pools and flattens the sequence dimension and then passes the
flattened vector through fully connected layers, so in order to obtain the full sequence tensor one has to specify
`reduce_output: null`

.

The encoder parameters specified at the feature level are:

(default`tied`

`null`

): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.

Example sequence feature entry in the input features list:

```
name: sequence_column_name
type: sequence
tied: null
encoder:
type: stacked_cnn
```

The available encoder parameters:

(default`type`

`parallel_cnn`

): the name of the encoder to use to encode the sequence, one of`embed`

,`parallel_cnn`

,`stacked_cnn`

,`stacked_parallel_cnn`

,`rnn`

,`cnnrnn`

,`transformer`

and`passthrough`

(equivalent to`null`

or`None`

).

Encoder type and encoder parameters can also be defined once and applied to all sequence input features using the Type-Global Encoder section.

### Encoders¶

#### Embed Encoder¶

```
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C["Aggregation\n Reduce\n Operation"];
C --> ...;
```

The embed encoder simply maps each token in the input sequence to an embedding, creating a `b x s x h`

tensor where `b`

is the batch size, `s`

is the length of the sequence and `h`

is the embedding size.
The tensor is reduced along the `s`

dimension to obtain a single vector of size `h`

for each element of the batch.
If you want to output the full `b x s x h`

tensor, you can specify `reduce_output: null`

.

```
encoder:
type: embed
dropout: 0.0
embedding_size: 256
representation: dense
weights_initializer: uniform
reduce_output: sum
embeddings_on_cpu: false
embeddings_trainable: true
pretrained_embeddings: null
```

Parameters:

(default:`dropout`

`0.0`

) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`embedding_size`

`256`

) : The maximum embedding size. The actual size will be`min(vocabulary_size, embedding_size)`

for`dense`

representations and exactly`vocabulary_size`

for the`sparse`

encoding, where`vocabulary_size`

is the number of unique strings appearing in the training set input column plus the number of special tokens (`<UNK>`

,`<PAD>`

,`<SOS>`

,`<EOS>`

).(default:`representation`

`dense`

): Representation of the embedding.`dense`

means the embeddings are initialized randomly,`sparse`

means they are initialized to be one-hot encodings. Options:`dense`

,`sparse`

.-
(default:`weights_initializer`

`uniform`

): Initializer for the weight matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`reduce_output`

`sum`

): How to reduce the output tensor along the`s`

sequence length dimension if the rank of the tensor is greater than 2. Options:`last`

,`sum`

,`mean`

,`avg`

,`max`

,`concat`

,`attention`

,`none`

,`None`

,`null`

. (default:`embeddings_on_cpu`

`false`

): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:`true`

,`false`

.-
(default:`embeddings_trainable`

`true`

): If`true`

embeddings are trained during the training process, if`false`

embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when`representation`

is`dense`

;`sparse`

one-hot encodings are not trainable. Options:`true`

,`false`

. -
(default:`pretrained_embeddings`

`null`

): Path to a file containing pretrained embeddings. By default`dense`

embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if`representation`

is`dense`

.

#### Parallel CNN Encoder¶

```
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
C --> D1["1D Conv\n Width 2"] --> E1["Pool"];
C --> D2["1D Conv\n Width 3"] --> E2["Pool"];
C --> D3["1D Conv\n Width 4"] --> E3["Pool"];
C --> D4["1D Conv\n Width 5"] --> E4["Pool"];
E1 --> F["Concat"] --> G["Fully\n Connected\n Layers"] --> H["..."];
E2 --> F;
E3 --> F;
E4 --> F;
```

The parallel cnn encoder is inspired by
Yoon Kim's Convolutional Neural Network for Sentence Classification.
It works by first mapping the input token sequence `b x s`

(where `b`

is the batch size and `s`

is the length of the
sequence) into a sequence of embeddings, then it passes the embedding through a number of parallel 1d convolutional
layers with different filter size (by default 4 layers with filter size 2, 3, 4 and 5), followed by max pooling and
concatenation.
This single vector concatenating the outputs of the parallel convolutional layers is then passed through a stack of
fully connected layers and returned as a `b x h`

tensor where `h`

is the output size of the last fully connected layer.
If you want to output the full `b x s x h`

tensor, you can specify `reduce_output: null`

.

```
encoder:
type: parallel_cnn
dropout: 0.0
embedding_size: 256
num_conv_layers: null
output_size: 256
activation: relu
filter_size: 3
norm: null
representation: dense
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: sum
conv_layers: null
pool_function: max
pool_size: null
norm_params: null
num_fc_layers: null
fc_layers: null
pretrained_embeddings: null
num_filters: 256
```

Parameters:

(default:`dropout`

`0.0`

) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`embedding_size`

`256`

) : The maximum embedding size. The actual size will be`min(vocabulary_size, embedding_size)`

for`dense`

representations and exactly`vocabulary_size`

for the`sparse`

encoding, where`vocabulary_size`

is the number of unique strings appearing in the training set input column plus the number of special tokens (`<UNK>`

,`<PAD>`

,`<SOS>`

,`<EOS>`

).(default:`num_conv_layers`

`null`

) : The number of stacked convolutional layers when`conv_layers`

is`null`

.(default:`output_size`

`256`

) : The default output_size that will be used for each layer.(default:`activation`

`relu`

): The default activation function that will be used for each layer. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`filter_size`

`3`

): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.(default:`norm`

`null`

): The default norm that will be used for each layer. Options:`batch`

,`layer`

,`null`

.(default:`representation`

`dense`

): Representation of the embedding.`dense`

means the embeddings are initialized randomly,`sparse`

means they are initialized to be one-hot encodings. Options:`dense`

,`sparse`

.(default:`use_bias`

`true`

): Whether to use a bias vector. Options:`true`

,`false`

.-
(default:`bias_initializer`

`zeros`

): Initializer for the bias vector. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`weights_initializer`

`xavier_uniform`

): Initializer for the weight matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`embeddings_on_cpu`

`false`

): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:`true`

,`false`

. (default:`embeddings_trainable`

`true`

): If`true`

embeddings are trained during the training process, if`false`

embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when`representation`

is`dense`

;`sparse`

one-hot encodings are not trainable. Options:`true`

,`false`

.(default:`reduce_output`

`sum`

): How to reduce the output tensor along the`s`

sequence length dimension if the rank of the tensor is greater than 2. Options:`last`

,`sum`

,`mean`

,`avg`

,`max`

,`concat`

,`attention`

,`none`

,`None`

,`null`

.-
(default:`conv_layers`

`null`

): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:`activation`

,`dropout`

,`norm`

,`norm_params`

,`num_filters`

,`filter_size`

,`strides`

,`padding`

,`dilation_rate`

,`use_bias`

,`pool_function`

,`pool_padding`

,`pool_size`

,`pool_strides`

,`bias_initializer`

,`weights_initializer`

. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If both`conv_layers`

and`num_conv_layers`

are`null`

, a default list will be assigned to`conv_layers`

with the value`[{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}]`

. -
(default:`pool_function`

`max`

): Pooling function to use.`max`

will select the maximum value. Any of`average`

,`avg`

, or`mean`

will compute the mean value Options:`last`

,`sum`

,`mean`

,`avg`

,`max`

,`concat`

,`attention`

,`none`

,`None`

,`null`

. (default:`pool_size`

`null`

): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the`s`

sequence dimension after the convolution operation.(default:`norm_params`

`null`

): Parameters used if norm is either`batch`

or`layer`

.(default:`num_fc_layers`

`null`

): Number of parallel fully connected layers to use.-
(default:`fc_layers`

`null`

): List of dictionaries containing the parameters for each fully connected layer. -
(default:`pretrained_embeddings`

`null`

): Path to a file containing pretrained embeddings. By default`dense`

embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if`representation`

is`dense`

. (default:`num_filters`

`256`

): Number of filters, and by consequence number of output channels of the 1d convolution.

#### Stacked CNN Encoder¶

```
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C["1D Conv Layers\n Different Widths"];
C --> D["Fully\n Connected\n Layers"];
D --> ...;
```

The stacked cnn encoder is inspired by Xiang Zhang at all's Character-level Convolutional Networks for Text Classification.
It works by first mapping the input token sequence `b x s`

(where `b`

is the batch size and `s`

is the length of the
sequence) into a sequence of embeddings, then it passes the embedding through a stack of 1d convolutional layers with
different filter size (by default 6 layers with filter size 7, 7, 3, 3, 3 and 3), followed by an optional final pool and
by a flatten operation.
This single flatten vector is then passed through a stack of fully connected layers and returned as a `b x h`

tensor
where `h`

is the output size of the last fully connected layer.
If you want to output the full `b x s x h`

tensor, you can specify the `pool_size`

of all your `conv_layers`

to be
`null`

and `reduce_output: null`

, while if `pool_size`

has a value different from `null`

and `reduce_output: null`

the
returned tensor will be of shape `b x s' x h`

, where `s'`

is width of the output of the last convolutional layer.

```
encoder:
type: stacked_cnn
dropout: 0.0
num_conv_layers: null
embedding_size: 256
output_size: 256
activation: relu
filter_size: 3
strides: 1
norm: null
representation: dense
conv_layers: null
pool_function: max
pool_size: null
dilation_rate: 1
pool_strides: null
pool_padding: same
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: sum
norm_params: null
num_fc_layers: null
fc_layers: null
num_filters: 256
padding: same
pretrained_embeddings: null
```

Parameters:

(default:`dropout`

`0.0`

) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`num_conv_layers`

`null`

) : The number of stacked convolutional layers when`conv_layers`

is`null`

.(default:`embedding_size`

`256`

) : The maximum embedding size. The actual size will be`min(vocabulary_size, embedding_size)`

for`dense`

representations and exactly`vocabulary_size`

for the`sparse`

encoding, where`vocabulary_size`

is the number of unique strings appearing in the training set input column plus the number of special tokens (`<UNK>`

,`<PAD>`

,`<SOS>`

,`<EOS>`

).(default:`output_size`

`256`

) : The default output_size that will be used for each layer.(default:`activation`

`relu`

): The default activation function that will be used for each layer. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`filter_size`

`3`

): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.(default:`strides`

`1`

): Stride length of the convolution.(default:`norm`

`null`

): The default norm that will be used for each layer. Options:`batch`

,`layer`

,`null`

.(default:`representation`

`dense`

): Representation of the embedding.`dense`

means the embeddings are initialized randomly,`sparse`

means they are initialized to be one-hot encodings. Options:`dense`

,`sparse`

.-
(default:`conv_layers`

`null`

): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:`activation`

,`dropout`

,`norm`

,`norm_params`

,`num_filters`

,`filter_size`

,`strides`

,`padding`

,`dilation_rate`

,`use_bias`

,`pool_function`

,`pool_padding`

,`pool_size`

,`pool_strides`

,`bias_initializer`

,`weights_initializer`

. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If both`conv_layers`

and`num_conv_layers`

are`null`

, a default list will be assigned to`conv_layers`

with the value`[{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}]`

. -
(default:`pool_function`

`max`

): Pooling function to use.`max`

will select the maximum value. Any of`average`

,`avg`

, or`mean`

will compute the mean value Options:`last`

,`sum`

,`mean`

,`avg`

,`max`

,`concat`

,`attention`

,`none`

,`None`

,`null`

. (default:`pool_size`

`null`

): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the`s`

sequence dimension after the convolution operation.(default:`dilation_rate`

`1`

): Dilation rate to use for dilated convolution.(default:`pool_strides`

`null`

): Factor to scale down.(default:`pool_padding`

`same`

): Padding to use. Options:`valid`

,`same`

.(default:`use_bias`

`true`

): Whether to use a bias vector. Options:`true`

,`false`

.-
(default:`bias_initializer`

`zeros`

): Initializer for the bias vector. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`weights_initializer`

`xavier_uniform`

): Initializer for the weight matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`embeddings_on_cpu`

`false`

): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:`true`

,`false`

. (default:`embeddings_trainable`

`true`

): If`true`

embeddings are trained during the training process, if`false`

embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when`representation`

is`dense`

;`sparse`

one-hot encodings are not trainable. Options:`true`

,`false`

.(default:`reduce_output`

`sum`

): How to reduce the output tensor along the`s`

sequence length dimension if the rank of the tensor is greater than 2. Options:`last`

,`sum`

,`mean`

,`avg`

,`max`

,`concat`

,`attention`

,`none`

,`None`

,`null`

.(default:`norm_params`

`null`

): Parameters used if norm is either`batch`

or`layer`

.(default:`num_fc_layers`

`null`

): Number of parallel fully connected layers to use.-
(default:`fc_layers`

`null`

): List of dictionaries containing the parameters for each fully connected layer. -
(default:`num_filters`

`256`

): Number of filters, and by consequence number of output channels of the 1d convolution. -
(default:`padding`

`same`

): Padding to use. Options:`valid`

,`same`

. -
(default:`pretrained_embeddings`

`null`

): Path to a file containing pretrained embeddings. By default`dense`

embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if`representation`

is`dense`

.

#### Stacked Parallel CNN Encoder¶

```
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
C --> D1["1D Conv\n Width 2"] --> E["Concat"];
C --> D2["1D Conv\n Width 3"] --> E;
C --> D3["1D Conv\n Width 4"] --> E;
C --> D4["1D Conv\n Width 5"] --> E;
E --> F["..."];
F --> G1["1D Conv\n Width 2"] --> H["Concat"];
F --> G2["1D Conv\n Width 3"] --> H;
F --> G3["1D Conv\n Width 4"] --> H;
F --> G4["1D Conv\n Width 5"] --> H;
H --> I["Pool"] --> J["Fully\n Connected\n Layers"] --> K["..."];
```

The stacked parallel cnn encoder is a combination of the Parallel CNN and the Stacked CNN encoders where each layer of
the stack is composed of parallel convolutional layers.
It works by first mapping the input token sequence `b x s`

(where `b`

is the batch size and `s`

is the length of the
sequence) into a sequence of embeddings, then it passes the embedding through a stack of several parallel 1d
convolutional layers with different filter size, followed by an optional final pool and by a flatten operation.
This single flattened vector is then passed through a stack of fully connected layers and returned as a `b x h`

tensor
where `h`

is the output size of the last fully connected layer.
If you want to output the full `b x s x h`

tensor, you can specify `reduce_output: null`

.

```
encoder:
type: stacked_parallel_cnn
dropout: 0.0
embedding_size: 256
output_size: 256
activation: relu
filter_size: 3
norm: null
representation: dense
num_stacked_layers: null
pool_function: max
pool_size: null
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: sum
norm_params: null
num_fc_layers: null
fc_layers: null
stacked_layers: null
num_filters: 256
pretrained_embeddings: null
```

Parameters:

(default:`dropout`

`0.0`

) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`embedding_size`

`256`

) : The maximum embedding size. The actual size will be`min(vocabulary_size, embedding_size)`

for`dense`

representations and exactly`vocabulary_size`

for the`sparse`

encoding, where`vocabulary_size`

is the number of unique strings appearing in the training set input column plus the number of special tokens (`<UNK>`

,`<PAD>`

,`<SOS>`

,`<EOS>`

).(default:`output_size`

`256`

) : The default output_size that will be used for each layer.(default:`activation`

`relu`

): The default activation function that will be used for each layer. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`filter_size`

`3`

): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.(default:`norm`

`null`

): The default norm that will be used for each layer. Options:`batch`

,`layer`

,`null`

.(default:`representation`

`dense`

): Representation of the embedding.`dense`

means the embeddings are initialized randomly,`sparse`

means they are initialized to be one-hot encodings. Options:`dense`

,`sparse`

.(default:`num_stacked_layers`

`null`

): If stacked_layers is null, this is the number of elements in the stack of parallel convolutional layers.(default:`pool_function`

`max`

): Pooling function to use.`max`

will select the maximum value. Any of`average`

,`avg`

, or`mean`

will compute the mean value Options:`last`

,`sum`

,`mean`

,`avg`

,`max`

,`concat`

,`attention`

,`none`

,`None`

,`null`

.(default:`pool_size`

`null`

): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the`s`

sequence dimension after the convolution operation.(default:`use_bias`

`true`

): Whether to use a bias vector. Options:`true`

,`false`

.-
(default:`bias_initializer`

`zeros`

): Initializer for the bias vector. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`weights_initializer`

`xavier_uniform`

): Initializer for the weight matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`embeddings_on_cpu`

`false`

): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:`true`

,`false`

. (default:`embeddings_trainable`

`true`

): If`true`

embeddings are trained during the training process, if`false`

embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when`representation`

is`dense`

;`sparse`

one-hot encodings are not trainable. Options:`true`

,`false`

.(default:`reduce_output`

`sum`

): How to reduce the output tensor along the`s`

sequence length dimension if the rank of the tensor is greater than 2. Options:`last`

,`sum`

,`mean`

,`avg`

,`max`

,`concat`

,`attention`

,`none`

,`None`

,`null`

.(default:`norm_params`

`null`

): Parameters used if norm is either`batch`

or`layer`

.(default:`num_fc_layers`

`null`

): Number of parallel fully connected layers to use.-
(default:`fc_layers`

`null`

): List of dictionaries containing the parameters for each fully connected layer. -
(default:`stacked_layers`

`null`

): a nested list of lists of dictionaries containing the parameters of the stack of parallel convolutional layers. The length of the list determines the number of stacked parallel convolutional layers, length of the sub-lists determines the number of parallel conv layers and the content of each dictionary determines the parameters for a specific layer. -
(default:`num_filters`

`256`

): Number of filters, and by consequence number of output channels of the 1d convolution. -
(default:`pretrained_embeddings`

`null`

): Path to a file containing pretrained embeddings. By default`dense`

embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if`representation`

is`dense`

.

#### RNN Encoder¶

```
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C["RNN Layers"];
C --> D["Fully\n Connected\n Layers"];
D --> ...;
```

The rnn encoder works by first mapping the input token sequence `b x s`

(where `b`

is the batch size and `s`

is the
length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of recurrent layers
(by default 1 layer), followed by a reduce operation that by default only returns the last output, but can perform other
reduce functions.
If you want to output the full `b x s x h`

where `h`

is the size of the output of the last rnn layer, you can specify
`reduce_output: null`

.

```
encoder:
type: rnn
dropout: 0.0
cell_type: rnn
num_layers: 1
state_size: 256
embedding_size: 256
output_size: 256
norm: null
num_fc_layers: 0
fc_dropout: 0.0
recurrent_dropout: 0.0
activation: tanh
fc_activation: relu
recurrent_activation: sigmoid
representation: dense
unit_forget_bias: true
recurrent_initializer: orthogonal
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: last
norm_params: null
fc_layers: null
bidirectional: false
pretrained_embeddings: null
```

Parameters:

(default:`dropout`

`0.0`

) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`cell_type`

`rnn`

) : The type of recurrent cell to use. Available values are:`rnn`

,`lstm`

,`gru`

. For reference about the differences between the cells please refer to torch.nn Recurrent Layers. Options:`rnn`

,`lstm`

,`gru`

.(default:`num_layers`

`1`

) : The number of stacked recurrent layers.(default:`state_size`

`256`

) : The size of the state of the rnn.(default:`embedding_size`

`256`

) : The maximum embedding size. The actual size will be`min(vocabulary_size, embedding_size)`

for`dense`

representations and exactly`vocabulary_size`

for the`sparse`

encoding, where`vocabulary_size`

is the number of unique strings appearing in the training set input column plus the number of special tokens (`<UNK>`

,`<PAD>`

,`<SOS>`

,`<EOS>`

).(default:`output_size`

`256`

) : The default output_size that will be used for each layer.(default:`norm`

`null`

) : The default norm that will be used for each layer. Options:`batch`

,`layer`

,`ghost`

,`null`

.(default:`num_fc_layers`

`0`

) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.(default:`fc_dropout`

`0.0`

) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`recurrent_dropout`

`0.0`

): The dropout rate for the recurrent state(default:`activation`

`tanh`

): The default activation function. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`fc_activation`

`relu`

): Default activation function applied to the output of the fully connected layers. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`recurrent_activation`

`sigmoid`

): The activation function to use in the recurrent step Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`representation`

`dense`

): Representation of the embedding.`dense`

means the embeddings are initialized randomly,`sparse`

means they are initialized to be one-hot encodings. Options:`dense`

,`sparse`

.(default:`unit_forget_bias`

`true`

): If true, add 1 to the bias of the forget gate at initialization Options:`true`

,`false`

.(default:`recurrent_initializer`

`orthogonal`

): The initializer for recurrent matrix weights Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

.(default:`use_bias`

`true`

): Whether to use a bias vector. Options:`true`

,`false`

.-
(default:`bias_initializer`

`zeros`

): Initializer for the bias vector. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`weights_initializer`

`xavier_uniform`

): Initializer for the weight matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`embeddings_on_cpu`

`false`

): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:`true`

,`false`

. (default:`embeddings_trainable`

`true`

): If`true`

embeddings are trained during the training process, if`false`

embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when`representation`

is`dense`

;`sparse`

one-hot encodings are not trainable. Options:`true`

,`false`

.(default:`reduce_output`

`last`

): How to reduce the output tensor along the`s`

sequence length dimension if the rank of the tensor is greater than 2. Options:`last`

,`sum`

,`mean`

,`avg`

,`max`

,`concat`

,`attention`

,`none`

,`None`

,`null`

.(default:`norm_params`

`null`

): Default parameters passed to the`norm`

module.-
(default:`fc_layers`

`null`

): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:`activation`

,`dropout`

,`norm`

,`norm_params`

,`output_size`

,`use_bias`

,`bias_initializer`

and`weights_initializer`

. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead. -
(default:`bidirectional`

`false`

): If true, two recurrent networks will perform encoding in the forward and backward direction and their outputs will be concatenated. Options:`true`

,`false`

. -
(default:`pretrained_embeddings`

`null`

): Path to a file containing pretrained embeddings. By default`dense`

embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if`representation`

is`dense`

.

#### CNN RNN Encoder¶

```
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C1["CNN Layers"];
C1 --> C2["RNN Layers"];
C2 --> D["Fully\n Connected\n Layers"];
D --> ...;
```

The `cnnrnn`

encoder works by first mapping the input token sequence `b x s`

(where `b`

is the batch size and `s`

is
the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of convolutional
layers (by default 2), that is followed by a stack of recurrent layers (by default 1), followed by a reduce operation
that by default only returns the last output, but can perform other reduce functions.
If you want to output the full `b x s x h`

where `h`

is the size of the output of the last rnn layer, you can specify
`reduce_output: null`

.

```
encoder:
type: cnnrnn
dropout: 0.0
conv_dropout: 0.0
cell_type: rnn
num_conv_layers: null
state_size: 256
embedding_size: 256
output_size: 256
norm: null
num_fc_layers: 0
fc_dropout: 0.0
recurrent_dropout: 0.0
activation: tanh
filter_size: 5
strides: 1
fc_activation: relu
recurrent_activation: sigmoid
conv_activation: relu
representation: dense
conv_layers: null
pool_function: max
pool_size: null
dilation_rate: 1
pool_strides: null
pool_padding: same
unit_forget_bias: true
recurrent_initializer: orthogonal
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: last
norm_params: null
fc_layers: null
num_filters: 256
padding: same
num_rec_layers: 1
bidirectional: false
pretrained_embeddings: null
```

Parameters:

(default:`dropout`

`0.0`

) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`conv_dropout`

`0.0`

) : The dropout rate for the convolutional layers(default:`cell_type`

`rnn`

) : The type of recurrent cell to use. Available values are:`rnn`

,`lstm`

,`gru`

. For reference about the differences between the cells please refer to torch.nn Recurrent Layers. Options:`rnn`

,`lstm`

,`gru`

.(default:`num_conv_layers`

`null`

) : The number of stacked convolutional layers when`conv_layers`

is`null`

.(default:`state_size`

`256`

) : The size of the state of the rnn.(default:`embedding_size`

`256`

) : The maximum embedding size. The actual size will be`min(vocabulary_size, embedding_size)`

for`dense`

representations and exactly`vocabulary_size`

for the`sparse`

encoding, where`vocabulary_size`

is the number of unique strings appearing in the training set input column plus the number of special tokens (`<UNK>`

,`<PAD>`

,`<SOS>`

,`<EOS>`

).(default:`output_size`

`256`

) : The default output_size that will be used for each layer.(default:`norm`

`null`

) : The default norm that will be used for each layer. Options:`batch`

,`layer`

,`ghost`

,`null`

.(default:`num_fc_layers`

`0`

) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.(default:`fc_dropout`

`0.0`

) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`recurrent_dropout`

`0.0`

): The dropout rate for the recurrent state(default:`activation`

`tanh`

): The default activation function to use. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`filter_size`

`5`

): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.(default:`strides`

`1`

): Stride length of the convolution.(default:`fc_activation`

`relu`

): Default activation function applied to the output of the fully connected layers. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`recurrent_activation`

`sigmoid`

): The activation function to use in the recurrent step Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`conv_activation`

`relu`

): The default activation function that will be used for each convolutional layer. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`representation`

`dense`

): Representation of the embedding.`dense`

means the embeddings are initialized randomly,`sparse`

means they are initialized to be one-hot encodings. Options:`dense`

,`sparse`

.-
(default:`conv_layers`

`null`

): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:`activation`

,`dropout`

,`norm`

,`norm_params`

,`num_filters`

,`filter_size`

,`strides`

,`padding`

,`dilation_rate`

,`use_bias`

,`pool_function`

,`pool_padding`

,`pool_size`

,`pool_strides`

,`bias_initializer`

,`weights_initializer`

. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If both`conv_layers`

and`num_conv_layers`

are`null`

, a default list will be assigned to`conv_layers`

with the value`[{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}]`

. -
(default:`pool_function`

`max`

): Pooling function to use.`max`

will select the maximum value. Any of`average`

,`avg`

, or`mean`

will compute the mean value Options:`last`

,`sum`

,`mean`

,`avg`

,`max`

,`concat`

,`attention`

,`none`

,`None`

,`null`

. (default:`pool_size`

`null`

): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along the`s`

sequence dimension after the convolution operation.(default:`dilation_rate`

`1`

): Dilation rate to use for dilated convolution.(default:`pool_strides`

`null`

): Factor to scale down.(default:`pool_padding`

`same`

): Padding to use. Options:`valid`

,`same`

.(default:`unit_forget_bias`

`true`

): If true, add 1 to the bias of the forget gate at initialization Options:`true`

,`false`

.(default:`recurrent_initializer`

`orthogonal`

): The initializer for recurrent matrix weights Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

.(default:`use_bias`

`true`

): Whether to use a bias vector. Options:`true`

,`false`

.-
(default:`bias_initializer`

`zeros`

): Initializer for the bias vector. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`weights_initializer`

`xavier_uniform`

): Initializer for the weight matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`embeddings_on_cpu`

`false`

): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:`true`

,`false`

. (default:`embeddings_trainable`

`true`

): If`true`

embeddings are trained during the training process, if`false`

embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when`representation`

is`dense`

;`sparse`

one-hot encodings are not trainable. Options:`true`

,`false`

.(default:`reduce_output`

`last`

): How to reduce the output tensor along the`s`

sequence length dimension if the rank of the tensor is greater than 2. Options:`last`

,`sum`

,`mean`

,`avg`

,`max`

,`concat`

,`attention`

,`none`

,`None`

,`null`

.(default:`norm_params`

`null`

): Default parameters passed to the`norm`

module.-
(default:`fc_layers`

`null`

): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:`activation`

,`dropout`

,`norm`

,`norm_params`

,`output_size`

,`use_bias`

,`bias_initializer`

and`weights_initializer`

. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead. -
(default:`num_filters`

`256`

): Number of filters, and by consequence number of output channels of the 1d convolution. (default:`padding`

`same`

): Padding to use. Options:`valid`

,`same`

.(default:`num_rec_layers`

`1`

): The number of stacked recurrent layers.-
(default:`bidirectional`

`false`

): If true, two recurrent networks will perform encoding in the forward and backward direction and their outputs will be concatenated. Options:`true`

,`false`

. -
(default:`pretrained_embeddings`

`null`

): Path to a file containing pretrained embeddings. By default`dense`

embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if`representation`

is`dense`

.

#### Transformer Encoder¶

```
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C["Transformer\n Blocks"];
C --> D["Fully\n Connected\n Layers"];
D --> ...;
```

The `transformer`

encoder implements a stack of transformer blocks, replicating the architecture introduced in the
Attention is all you need paper, and adds am optional stack of fully connected
layers at the end.

```
encoder:
type: transformer
dropout: 0.1
num_layers: 1
embedding_size: 256
output_size: 256
norm: null
num_fc_layers: 0
fc_dropout: 0.0
hidden_size: 256
transformer_output_size: 256
fc_activation: relu
representation: dense
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: last
norm_params: null
fc_layers: null
num_heads: 8
pretrained_embeddings: null
```

Parameters:

(default:`dropout`

`0.1`

) : The dropout rate for the transformer block. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`num_layers`

`1`

) : The number of transformer layers.(default:`embedding_size`

`256`

) : The maximum embedding size. The actual size will be`min(vocabulary_size, embedding_size)`

for`dense`

representations and exactly`vocabulary_size`

for the`sparse`

encoding, where`vocabulary_size`

is the number of unique strings appearing in the training set input column plus the number of special tokens (`<UNK>`

,`<PAD>`

,`<SOS>`

,`<EOS>`

).(default:`output_size`

`256`

) : The default output_size that will be used for each layer.(default:`norm`

`null`

) : The default norm that will be used for each layer. Options:`batch`

,`layer`

,`ghost`

,`null`

.(default:`num_fc_layers`

`0`

) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.(default:`fc_dropout`

`0.0`

) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`hidden_size`

`256`

): The size of the hidden representation within the transformer block. It is usually the same as the embedding_size, but if the two values are different, a projection layer will be added before the first transformer block.(default:`transformer_output_size`

`256`

): Size of the fully connected layer after self attention in the transformer block. This is usually the same as hidden_size and embedding_size.(default:`fc_activation`

`relu`

): Default activation function applied to the output of the fully connected layers. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`representation`

`dense`

): Representation of the embedding.`dense`

means the embeddings are initialized randomly,`sparse`

means they are initialized to be one-hot encodings. Options:`dense`

,`sparse`

.(default:`use_bias`

`true`

): Whether to use a bias vector. Options:`true`

,`false`

.-
(default:`bias_initializer`

`zeros`

): Initializer for the bias vector. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`weights_initializer`

`xavier_uniform`

): Initializer for the weight matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`embeddings_on_cpu`

`false`

): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:`true`

,`false`

. (default:`embeddings_trainable`

`true`

): If`true`

embeddings are trained during the training process, if`false`

embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when`representation`

is`dense`

;`sparse`

one-hot encodings are not trainable. Options:`true`

,`false`

.(default:`reduce_output`

`last`

): How to reduce the output tensor along the`s`

sequence length dimension if the rank of the tensor is greater than 2. Options:`last`

,`sum`

,`mean`

,`avg`

,`max`

,`concat`

,`attention`

,`none`

,`None`

,`null`

.(default:`norm_params`

`null`

): Default parameters passed to the`norm`

module.-
(default:`fc_layers`

`null`

): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:`activation`

,`dropout`

,`norm`

,`norm_params`

,`output_size`

,`use_bias`

,`bias_initializer`

and`weights_initializer`

. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead. -
(default:`num_heads`

`8`

): Number of attention heads in each transformer block. -
(default:`pretrained_embeddings`

`null`

): Path to a file containing pretrained embeddings. By default`dense`

embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if`representation`

is`dense`

.

## Output Features¶

Sequence output features can be used for either tagging (classifying each element of an input sequence) or
generation (generating a sequence by sampling from the model). Ludwig provides two sequence decoders named `tagger`

and
`generator`

respectively.

Example sequence output feature using default parameters:

```
name: seq_column_name
type: sequence
reduce_input: null
dependencies: []
reduce_dependencies: sum
loss:
type: softmax_cross_entropy
confidence_penalty: 0
robust_lambda: 0
class_weights: 1
class_similarities_temperature: 0
decoder:
type: generator
```

Parameters:

(default`reduce_input`

`sum`

): defines how to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:`sum`

,`mean`

or`avg`

,`max`

,`concat`

(concatenates along the sequence dimension),`last`

(returns the last vector of the sequence dimension).(default`dependencies`

`[]`

): the output features this one is dependent on. For a detailed explanation refer to Output Feature Dependencies.(default`reduce_dependencies`

`sum`

): defines how to reduce the output of a dependent feature that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:`sum`

,`mean`

or`avg`

,`max`

,`concat`

(concatenates along the sequence dimension),`last`

(returns the last vector of the sequence dimension).(default`loss`

`{type: softmax_cross_entropy, class_similarities_temperature: 0, class_weights: 1, confidence_penalty: 0, robust_lambda: 0}`

): is a dictionary containing a loss`type`

. The only available loss`type`

for sequences is`softmax_cross_entropy`

. See Loss for details.(default:`decoder`

`{"type": "generator"}`

): Decoder for the desired task. Options:`generator`

,`tagger`

. See Decoder for details.

Decoder type and decoder parameters can also be defined once and applied to all sequence output features using the Type-Global Decoder section. Loss and loss related parameters can also be defined once in the same way.

### Decoders¶

#### Generator¶

```
graph LR
A["Combiner Output"] --> B["Fully\n Connected\n Layers"];
B --> C1["RNN"] --> C2["RNN"] --> C3["RNN"];
GO(["GO"]) -.-o C1;
C1 -.-o O1("Output");
O1 -.-o C2;
C2 -.-o O2("Output");
O2 -.-o C3;
C3 -.-o END(["END"]);
subgraph DEC["DECODER.."]
B
C1
C2
C3
end
```

In the case of `generator`

the decoder is a (potentially empty) stack of fully connected layers, followed by an rnn that
generates outputs feeding on its own previous predictions and generates a tensor of size `b x s' x c`

, where `b`

is the
batch size, `s'`

is the length of the generated sequence and `c`

is the number of classes, followed by a
softmax_cross_entropy.
During training teacher forcing is adopted, meaning the list of targets is provided as both inputs and outputs (shifted
by 1), while at evaluation time greedy decoding (generating one token at a time and feeding it as input for the next
step) is performed by beam search, using a beam of 1 by default.
In general a generator expects a `b x h`

shaped input tensor, where `h`

is a hidden dimension.
The `h`

vectors are (after an optional stack of fully connected layers) fed into the rnn generator.
One exception is when the generator uses attention, as in that case the expected size of the input tensor is
`b x s x h`

, which is the output of a sequence, text or time series input feature without reduced outputs or the output
of a sequence-based combiner.
If a `b x h`

input is provided to a generator decoder using an rnn with attention instead, an error will be raised
during model building.

```
decoder:
type: generator
num_fc_layers: 0
fc_output_size: 256
fc_norm: null
fc_dropout: 0.0
cell_type: gru
num_layers: 1
fc_activation: relu
reduce_input: sum
fc_layers: null
fc_use_bias: true
fc_weights_initializer: xavier_uniform
fc_bias_initializer: zeros
fc_norm_params: null
```

Parameters:

(default:`num_fc_layers`

`0`

) : Number of fully-connected layers if`fc_layers`

not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.(default:`fc_output_size`

`256`

) : Output size of fully connected stack.(default:`fc_norm`

`null`

) : Default normalization applied at the beginnging of fully connected layers. Options:`batch`

,`layer`

,`ghost`

,`null`

.(default:`fc_dropout`

`0.0`

) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`cell_type`

`gru`

) : Type of recurrent cell to use. Options:`rnn`

,`lstm`

,`gru`

.(default:`num_layers`

`1`

) : The number of stacked recurrent layers.(default:`fc_activation`

`relu`

): Default activation function applied to the output of the fully connected layers. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`reduce_input`

`sum`

): How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension) Options:`sum`

,`mean`

,`avg`

,`max`

,`concat`

,`last`

.(default:`fc_layers`

`null`

): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:`activation`

,`dropout`

,`norm`

,`norm_params`

,`output_size`

,`use_bias`

,`bias_initializer`

and`weights_initializer`

. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.(default:`fc_use_bias`

`true`

): Whether the layer uses a bias vector in the fc_stack. Options:`true`

,`false`

.(default:`fc_weights_initializer`

`xavier_uniform`

): The weights initializer to use for the layers in the fc_stack(default:`fc_bias_initializer`

`zeros`

): The bias initializer to use for the layers in the fc_stack(default:`fc_norm_params`

`null`

): Default parameters passed to the`norm`

module.

#### Tagger¶

```
graph LR
A["emb[0]\n....\nemb[n]"] --> B["Fully\n Connected\n Layers"];
B --> C["Projection\n....\nProjection"];
C --> D["Softmax\n....\nSoftmax"];
subgraph DEC["DECODER.."]
B
C
D
end
subgraph COM["COMBINER OUT.."]
A
end
```

In the case of `tagger`

the decoder is a (potentially empty) stack of fully connected layers, followed by a projection
into a tensor of size `b x s x c`

, where `b`

is the batch size, `s`

is the length of the sequence and `c`

is the number
of classes, followed by a softmax_cross_entropy.
This decoder requires its input to be shaped as `b x s x h`

, where `h`

is a hidden dimension, which is the output of a
sequence, text or time series input feature without reduced outputs or the output of a sequence-based combiner.
If a `b x h`

input is provided instead, an error will be raised during model building.

```
decoder:
type: tagger
num_fc_layers: 0
fc_output_size: 256
fc_norm: null
fc_dropout: 0.0
fc_activation: relu
attention_embedding_size: 256
fc_layers: null
fc_use_bias: true
fc_weights_initializer: xavier_uniform
fc_bias_initializer: zeros
fc_norm_params: null
use_attention: false
use_bias: true
attention_num_heads: 8
```

Parameters:

(default:`num_fc_layers`

`0`

) : Number of fully-connected layers if`fc_layers`

not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.(default:`fc_output_size`

`256`

) : Output size of fully connected stack.(default:`fc_norm`

`null`

) : Default normalization applied at the beginnging of fully connected layers. Options:`batch`

,`layer`

,`ghost`

,`null`

.(default:`fc_dropout`

`0.0`

) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`fc_activation`

`relu`

): Default activation function applied to the output of the fully connected layers. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`attention_embedding_size`

`256`

): The embedding size of the multi-head self attention layer.(default:`fc_layers`

`null`

): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:`activation`

,`dropout`

,`norm`

,`norm_params`

,`output_size`

,`use_bias`

,`bias_initializer`

and`weights_initializer`

. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.(default:`fc_use_bias`

`true`

): Whether the layer uses a bias vector in the fc_stack. Options:`true`

,`false`

.(default:`fc_weights_initializer`

`xavier_uniform`

): The weights initializer to use for the layers in the fc_stack(default:`fc_bias_initializer`

`zeros`

): The bias initializer to use for the layers in the fc_stack(default:`fc_norm_params`

`null`

): Default parameters passed to the`norm`

module.(default:`use_attention`

`false`

): Whether to apply a multi-head self attention layer before prediction. Options:`true`

,`false`

.(default:`use_bias`

`true`

): Whether the layer uses a bias vector. Options:`true`

,`false`

.(default:`attention_num_heads`

`8`

): The number of attention heads in the multi-head self attention layer.

### Loss¶

#### Sequence Softmax Cross Entropy¶

```
loss:
type: sequence_softmax_cross_entropy
class_weights: null
weight: 1.0
robust_lambda: 0
confidence_penalty: 0
class_similarities: null
class_similarities_temperature: 0
unique: false
```

Parameters:

(default:`class_weights`

`null`

) : Weights to apply to each class in the loss. If not specified, all classes are weighted equally. The value can be a vector of weights, one for each class, that is multiplied to the loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in case of unbalanced class distribution. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the`<UNK>`

class needs to be included too). Alternatively, the value can be a dictionary with class strings as keys and weights as values, like`{class_a: 0.5, class_b: 0.7, ...}`

.(default:`weight`

`1.0`

): Weight of the loss.(default:`robust_lambda`

`0`

): Replaces the loss with`(1 - robust_lambda) * loss + robust_lambda / c`

where`c`

is the number of classes. Useful in case of noisy labels.(default:`confidence_penalty`

`0`

): Penalizes overconfident predictions (low entropy) by adding an additional term that penalizes too confident predictions by adding a`a * (max_entropy - entropy) / max_entropy`

term to the loss, where a is the value of this parameter. Useful in case of noisy labels.(default:`class_similarities`

`null`

): If not`null`

it is a`c x c`

matrix in the form of a list of lists that contains the mutual similarity of classes. It is used if`class_similarities_temperature`

is greater than 0. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the`<UNK>`

class needs to be included too).(default:`class_similarities_temperature`

`0`

): The temperature parameter of the softmax that is performed on each row of`class_similarities`

. The output of that softmax is used to determine the supervision vector to provide instead of the one hot vector that would be provided otherwise for each datapoint. The intuition behind it is that errors between similar classes are more tolerable than errors between really different classes.(default:`unique`

`false`

): If true, the loss is only computed for unique elements in the sequence. Options:`true`

,`false`

.

### Metrics¶

The metrics that are calculated every epoch and are available for sequence features are:

`sequence_accuracy`

The rate at which the model predicted the correct sequence.`token_accuracy`

The number of tokens correctly predicted divided by the total number of tokens in all sequences.`last_accuracy`

Accuracy considering only the last element of the sequence. Useful to ensure special end-of-sequence tokens are generated or tagged.`edit_distance`

Levenshtein distance: the minimum number of single-token edits (insertions, deletions or substitutions) required to change predicted sequence to ground truth.`perplexity`

Perplexity is the inverse of the predicted probability of the ground truth sequence, normalized by the number of tokens. The lower the perplexity, the higher the probability of predicting the true sequence.`loss`

The value of the loss function.

You can set any of the above as `validation_metric`

in the `training`

section of the configuration if `validation_field`

names a sequence feature.