⇅ Sequence Features
Preprocessing¶
Sequence features are transformed into an integer valued matrix of size n x l
(where n
is the number of rows and l
is the minimum of the length of the longest sequence and a max_sequence_length
parameter) and added to HDF5 with a
key that reflects the name of column in the dataset.
Each sequence in mapped to a list of integers internally. First, a tokenizer converts each sequence to a list of tokens
(default tokenization is done by splitting on spaces).
Next, a dictionary is constructed which maps each unique token to its frequency in the dataset column. Tokens are ranked
by frequency and a sequential integer ID is assigned from the most frequent to the most rare. Ludwig uses <PAD>
,
<UNK>
, <SOS>
, and <EOS>
special symbols for padding, unknown, start, and end, consistent with common NLP deep
learning practices. Special symbols can also be set manually in the preprocessing config.
The column name is added to the JSON file, with an associated dictionary containing
- the mapping from integer to string (
idx2str
) - the mapping from string to id (
str2idx
) - the mapping from string to frequency (
str2freq
) - the maximum length of all sequences (
max_sequence_length
) - additional preprocessing information (how to fill missing values and what token to use to fill missing values)
preprocessing:
tokenizer: space
sequence_length: null
max_sequence_length: 256
missing_value_strategy: fill_with_const
most_common: 20000
lowercase: false
fill_value: <UNK>
ngram_size: 2
padding_symbol: <PAD>
unknown_symbol: <UNK>
padding: right
cache_encoder_embeddings: false
vocab_file: null
Parameters:
tokenizer
(default:space
) : Defines how to map from the raw string content of the dataset column to a sequence of elements.sequence_length
(default:null
) : The desired length (number of tokens) of the sequence. Sequences that are longer than this value will be truncated and sequences shorter than this value will be padded. If None, sequence length will be inferred from the training dataset.max_sequence_length
(default:256
) : The maximum length (number of tokens) of the sequence. Sequences that are longer than this value will be truncated. Useful as a stopgap measure ifsequence_length
is set toNone
. IfNone
, max sequence length will be inferred from the training dataset.missing_value_strategy
(default:fill_with_const
) : What strategy to follow when there's a missing value in a text column Options:fill_with_const
,fill_with_mode
,bfill
,ffill
,drop_row
. See Missing Value Strategy for details.most_common
(default:20000
): The maximum number of most common tokens in the vocabulary. If the data contains more than this amount, the most infrequent symbols will be treated as unknown.lowercase
(default:false
): If true, converts the string to lowercase before tokenizing. Options:true
,false
.fill_value
(default:<UNK>
): The value to replace missing values with in case the missing_value_strategy is fill_with_constngram_size
(default:2
): The size of the ngram when using thengram
tokenizer (e.g, 2 = bigram, 3 = trigram, etc.).padding_symbol
(default:<PAD>
): The string used as a padding symbol. This special token is mapped to the integer ID 0 in the vocabulary.unknown_symbol
(default:<UNK>
): The string used as an unknown placeholder. This special token is mapped to the integer ID 1 in the vocabulary.padding
(default:right
): The direction of the padding. Options:left
,right
.cache_encoder_embeddings
(default:false
): Compute encoder embeddings in preprocessing, speeding up training time considerably. Options:true
,false
.vocab_file
(default:null
): Filepath string to a UTF-8 encoded file containing the sequence's vocabulary. On each line the first string until or is considered a word.
Preprocessing parameters can also be defined once and applied to all sequence input features using the Type-Global Preprocessing section.
Input Features¶
Sequence features have several encoders and each of them has its own parameters.
Inputs are of size b
while outputs are of size b x h
where b
is the batch size and h
is the dimensionality of
the output of the encoder.
In case a representation for each element of the sequence is needed (for example for tagging them, or for using an
attention mechanism), one can specify the parameter reduce_output
to be null
and the output will be a b x s x h
tensor where s
is the length of the sequence.
Some encoders, because of their inner workings, may require additional parameters to be specified in order to obtain one
representation for each element of the sequence.
For instance the parallel_cnn
encoder by default pools and flattens the sequence dimension and then passes the
flattened vector through fully connected layers, so in order to obtain the full sequence tensor one has to specify
reduce_output: null
.
The encoder parameters specified at the feature level are:
tied
(defaultnull
): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.
Example sequence feature entry in the input features list:
name: sequence_column_name
type: sequence
tied: null
encoder:
type: stacked_cnn
The available encoder parameters:
type
(defaultparallel_cnn
): the name of the encoder to use to encode the sequence, one ofembed
,parallel_cnn
,stacked_cnn
,stacked_parallel_cnn
,rnn
,cnnrnn
,transformer
andpassthrough
(equivalent tonull
orNone
).
Encoder type and encoder parameters can also be defined once and applied to all sequence input features using the Type-Global Encoder section.
Encoders¶
Embed Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C["Aggregation\n Reduce\n Operation"];
C --> ...;
The embed encoder simply maps each token in the input sequence to an embedding, creating a b x s x h
tensor where b
is the batch size, s
is the length of the sequence and h
is the embedding size.
The tensor is reduced along the s
dimension to obtain a single vector of size h
for each element of the batch.
If you want to output the full b x s x h
tensor, you can specify reduce_output: null
.
encoder:
type: embed
dropout: 0.0
embedding_size: 256
representation: dense
weights_initializer: uniform
reduce_output: sum
embeddings_on_cpu: false
embeddings_trainable: true
pretrained_embeddings: null
Parameters:
dropout
(default:0.0
) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.-
weights_initializer
(default:uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
reduce_output
(default:sum
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
. embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
.-
embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
.
Parallel CNN Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
C --> D1["1D Conv\n Width 2"] --> E1["Pool"];
C --> D2["1D Conv\n Width 3"] --> E2["Pool"];
C --> D3["1D Conv\n Width 4"] --> E3["Pool"];
C --> D4["1D Conv\n Width 5"] --> E4["Pool"];
E1 --> F["Concat"] --> G["Fully\n Connected\n Layers"] --> H["..."];
E2 --> F;
E3 --> F;
E4 --> F;
The parallel cnn encoder is inspired by
Yoon Kim's Convolutional Neural Network for Sentence Classification.
It works by first mapping the input token sequence b x s
(where b
is the batch size and s
is the length of the
sequence) into a sequence of embeddings, then it passes the embedding through a number of parallel 1d convolutional
layers with different filter size (by default 4 layers with filter size 2, 3, 4 and 5), followed by max pooling and
concatenation.
This single vector concatenating the outputs of the parallel convolutional layers is then passed through a stack of
fully connected layers and returned as a b x h
tensor where h
is the output size of the last fully connected layer.
If you want to output the full b x s x h
tensor, you can specify reduce_output: null
.
encoder:
type: parallel_cnn
dropout: 0.0
embedding_size: 256
num_conv_layers: null
output_size: 256
activation: relu
filter_size: 3
norm: null
representation: dense
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: sum
conv_layers: null
pool_function: max
pool_size: null
norm_params: null
num_fc_layers: null
fc_layers: null
pretrained_embeddings: null
num_filters: 256
Parameters:
dropout
(default:0.0
) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).num_conv_layers
(default:null
) : The number of stacked convolutional layers whenconv_layers
isnull
.output_size
(default:256
) : The default output_size that will be used for each layer.activation
(default:relu
): The default activation function that will be used for each layer. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.filter_size
(default:3
): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.norm
(default:null
): The default norm that will be used for each layer. Options:batch
,layer
,null
.representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.use_bias
(default:true
): Whether to use a bias vector. Options:true
,false
.-
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
. embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
.reduce_output
(default:sum
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.-
conv_layers
(default:null
): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,num_filters
,filter_size
,strides
,padding
,dilation_rate
,use_bias
,pool_function
,pool_padding
,pool_size
,pool_strides
,bias_initializer
,weights_initializer
. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If bothconv_layers
andnum_conv_layers
arenull
, a default list will be assigned toconv_layers
with the value[{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}]
. -
pool_function
(default:max
): Pooling function to use.max
will select the maximum value. Any ofaverage
,avg
, ormean
will compute the mean value Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
. pool_size
(default:null
): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along thes
sequence dimension after the convolution operation.norm_params
(default:null
): Parameters used if norm is eitherbatch
orlayer
.num_fc_layers
(default:null
): Number of parallel fully connected layers to use.-
fc_layers
(default:null
): List of dictionaries containing the parameters for each fully connected layer. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
. num_filters
(default:256
): Number of filters, and by consequence number of output channels of the 1d convolution.
Stacked CNN Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C["1D Conv Layers\n Different Widths"];
C --> D["Fully\n Connected\n Layers"];
D --> ...;
The stacked cnn encoder is inspired by Xiang Zhang at all's Character-level Convolutional Networks for Text Classification.
It works by first mapping the input token sequence b x s
(where b
is the batch size and s
is the length of the
sequence) into a sequence of embeddings, then it passes the embedding through a stack of 1d convolutional layers with
different filter size (by default 6 layers with filter size 7, 7, 3, 3, 3 and 3), followed by an optional final pool and
by a flatten operation.
This single flatten vector is then passed through a stack of fully connected layers and returned as a b x h
tensor
where h
is the output size of the last fully connected layer.
If you want to output the full b x s x h
tensor, you can specify the pool_size
of all your conv_layers
to be
null
and reduce_output: null
, while if pool_size
has a value different from null
and reduce_output: null
the
returned tensor will be of shape b x s' x h
, where s'
is width of the output of the last convolutional layer.
encoder:
type: stacked_cnn
dropout: 0.0
num_conv_layers: null
embedding_size: 256
output_size: 256
activation: relu
filter_size: 3
strides: 1
norm: null
representation: dense
conv_layers: null
pool_function: max
pool_size: null
dilation_rate: 1
pool_strides: null
pool_padding: same
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: sum
norm_params: null
num_fc_layers: null
fc_layers: null
num_filters: 256
padding: same
pretrained_embeddings: null
Parameters:
dropout
(default:0.0
) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).num_conv_layers
(default:null
) : The number of stacked convolutional layers whenconv_layers
isnull
.embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).output_size
(default:256
) : The default output_size that will be used for each layer.activation
(default:relu
): The default activation function that will be used for each layer. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.filter_size
(default:3
): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.strides
(default:1
): Stride length of the convolution.norm
(default:null
): The default norm that will be used for each layer. Options:batch
,layer
,null
.representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.-
conv_layers
(default:null
): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,num_filters
,filter_size
,strides
,padding
,dilation_rate
,use_bias
,pool_function
,pool_padding
,pool_size
,pool_strides
,bias_initializer
,weights_initializer
. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If bothconv_layers
andnum_conv_layers
arenull
, a default list will be assigned toconv_layers
with the value[{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}]
. -
pool_function
(default:max
): Pooling function to use.max
will select the maximum value. Any ofaverage
,avg
, ormean
will compute the mean value Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
. pool_size
(default:null
): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along thes
sequence dimension after the convolution operation.dilation_rate
(default:1
): Dilation rate to use for dilated convolution.pool_strides
(default:null
): Factor to scale down.pool_padding
(default:same
): Padding to use. Options:valid
,same
.use_bias
(default:true
): Whether to use a bias vector. Options:true
,false
.-
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
. embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
.reduce_output
(default:sum
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.norm_params
(default:null
): Parameters used if norm is eitherbatch
orlayer
.num_fc_layers
(default:null
): Number of parallel fully connected layers to use.-
fc_layers
(default:null
): List of dictionaries containing the parameters for each fully connected layer. -
num_filters
(default:256
): Number of filters, and by consequence number of output channels of the 1d convolution. -
padding
(default:same
): Padding to use. Options:valid
,same
. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
.
Stacked Parallel CNN Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
C --> D1["1D Conv\n Width 2"] --> E["Concat"];
C --> D2["1D Conv\n Width 3"] --> E;
C --> D3["1D Conv\n Width 4"] --> E;
C --> D4["1D Conv\n Width 5"] --> E;
E --> F["..."];
F --> G1["1D Conv\n Width 2"] --> H["Concat"];
F --> G2["1D Conv\n Width 3"] --> H;
F --> G3["1D Conv\n Width 4"] --> H;
F --> G4["1D Conv\n Width 5"] --> H;
H --> I["Pool"] --> J["Fully\n Connected\n Layers"] --> K["..."];
The stacked parallel cnn encoder is a combination of the Parallel CNN and the Stacked CNN encoders where each layer of
the stack is composed of parallel convolutional layers.
It works by first mapping the input token sequence b x s
(where b
is the batch size and s
is the length of the
sequence) into a sequence of embeddings, then it passes the embedding through a stack of several parallel 1d
convolutional layers with different filter size, followed by an optional final pool and by a flatten operation.
This single flattened vector is then passed through a stack of fully connected layers and returned as a b x h
tensor
where h
is the output size of the last fully connected layer.
If you want to output the full b x s x h
tensor, you can specify reduce_output: null
.
encoder:
type: stacked_parallel_cnn
dropout: 0.0
embedding_size: 256
output_size: 256
activation: relu
filter_size: 3
norm: null
representation: dense
num_stacked_layers: null
pool_function: max
pool_size: null
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: sum
norm_params: null
num_fc_layers: null
fc_layers: null
stacked_layers: null
num_filters: 256
pretrained_embeddings: null
Parameters:
dropout
(default:0.0
) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).output_size
(default:256
) : The default output_size that will be used for each layer.activation
(default:relu
): The default activation function that will be used for each layer. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.filter_size
(default:3
): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.norm
(default:null
): The default norm that will be used for each layer. Options:batch
,layer
,null
.representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.num_stacked_layers
(default:null
): If stacked_layers is null, this is the number of elements in the stack of parallel convolutional layers.pool_function
(default:max
): Pooling function to use.max
will select the maximum value. Any ofaverage
,avg
, ormean
will compute the mean value Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.pool_size
(default:null
): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along thes
sequence dimension after the convolution operation.use_bias
(default:true
): Whether to use a bias vector. Options:true
,false
.-
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
. embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
.reduce_output
(default:sum
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.norm_params
(default:null
): Parameters used if norm is eitherbatch
orlayer
.num_fc_layers
(default:null
): Number of parallel fully connected layers to use.-
fc_layers
(default:null
): List of dictionaries containing the parameters for each fully connected layer. -
stacked_layers
(default:null
): a nested list of lists of dictionaries containing the parameters of the stack of parallel convolutional layers. The length of the list determines the number of stacked parallel convolutional layers, length of the sub-lists determines the number of parallel conv layers and the content of each dictionary determines the parameters for a specific layer. -
num_filters
(default:256
): Number of filters, and by consequence number of output channels of the 1d convolution. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
.
RNN Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C["RNN Layers"];
C --> D["Fully\n Connected\n Layers"];
D --> ...;
The rnn encoder works by first mapping the input token sequence b x s
(where b
is the batch size and s
is the
length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of recurrent layers
(by default 1 layer), followed by a reduce operation that by default only returns the last output, but can perform other
reduce functions.
If you want to output the full b x s x h
where h
is the size of the output of the last rnn layer, you can specify
reduce_output: null
.
encoder:
type: rnn
dropout: 0.0
cell_type: rnn
num_layers: 1
state_size: 256
embedding_size: 256
output_size: 256
norm: null
num_fc_layers: 0
fc_dropout: 0.0
recurrent_dropout: 0.0
activation: tanh
fc_activation: relu
recurrent_activation: sigmoid
representation: dense
unit_forget_bias: true
recurrent_initializer: orthogonal
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: last
norm_params: null
fc_layers: null
bidirectional: false
pretrained_embeddings: null
Parameters:
dropout
(default:0.0
) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).cell_type
(default:rnn
) : The type of recurrent cell to use. Available values are:rnn
,lstm
,gru
. For reference about the differences between the cells please refer to torch.nn Recurrent Layers. Options:rnn
,lstm
,gru
.num_layers
(default:1
) : The number of stacked recurrent layers.state_size
(default:256
) : The size of the state of the rnn.embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).output_size
(default:256
) : The default output_size that will be used for each layer.norm
(default:null
) : The default norm that will be used for each layer. Options:batch
,layer
,ghost
,null
.num_fc_layers
(default:0
) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.fc_dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).recurrent_dropout
(default:0.0
): The dropout rate for the recurrent stateactivation
(default:tanh
): The default activation function. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.fc_activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.recurrent_activation
(default:sigmoid
): The activation function to use in the recurrent step Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.unit_forget_bias
(default:true
): If true, add 1 to the bias of the forget gate at initialization Options:true
,false
.recurrent_initializer
(default:orthogonal
): The initializer for recurrent matrix weights Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
.use_bias
(default:true
): Whether to use a bias vector. Options:true
,false
.-
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
. embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
.reduce_output
(default:last
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.norm_params
(default:null
): Default parameters passed to thenorm
module.-
fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead. -
bidirectional
(default:false
): If true, two recurrent networks will perform encoding in the forward and backward direction and their outputs will be concatenated. Options:true
,false
. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
.
CNN RNN Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C1["CNN Layers"];
C1 --> C2["RNN Layers"];
C2 --> D["Fully\n Connected\n Layers"];
D --> ...;
The cnnrnn
encoder works by first mapping the input token sequence b x s
(where b
is the batch size and s
is
the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of convolutional
layers (by default 2), that is followed by a stack of recurrent layers (by default 1), followed by a reduce operation
that by default only returns the last output, but can perform other reduce functions.
If you want to output the full b x s x h
where h
is the size of the output of the last rnn layer, you can specify
reduce_output: null
.
encoder:
type: cnnrnn
dropout: 0.0
conv_dropout: 0.0
cell_type: rnn
num_conv_layers: null
state_size: 256
embedding_size: 256
output_size: 256
norm: null
num_fc_layers: 0
fc_dropout: 0.0
recurrent_dropout: 0.0
activation: tanh
filter_size: 5
strides: 1
fc_activation: relu
recurrent_activation: sigmoid
conv_activation: relu
representation: dense
conv_layers: null
pool_function: max
pool_size: null
dilation_rate: 1
pool_strides: null
pool_padding: same
unit_forget_bias: true
recurrent_initializer: orthogonal
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: last
norm_params: null
fc_layers: null
num_filters: 256
padding: same
num_rec_layers: 1
bidirectional: false
pretrained_embeddings: null
Parameters:
dropout
(default:0.0
) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).conv_dropout
(default:0.0
) : The dropout rate for the convolutional layerscell_type
(default:rnn
) : The type of recurrent cell to use. Available values are:rnn
,lstm
,gru
. For reference about the differences between the cells please refer to torch.nn Recurrent Layers. Options:rnn
,lstm
,gru
.num_conv_layers
(default:null
) : The number of stacked convolutional layers whenconv_layers
isnull
.state_size
(default:256
) : The size of the state of the rnn.embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).output_size
(default:256
) : The default output_size that will be used for each layer.norm
(default:null
) : The default norm that will be used for each layer. Options:batch
,layer
,ghost
,null
.num_fc_layers
(default:0
) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.fc_dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).recurrent_dropout
(default:0.0
): The dropout rate for the recurrent stateactivation
(default:tanh
): The default activation function to use. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.filter_size
(default:5
): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.strides
(default:1
): Stride length of the convolution.fc_activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.recurrent_activation
(default:sigmoid
): The activation function to use in the recurrent step Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.conv_activation
(default:relu
): The default activation function that will be used for each convolutional layer. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.-
conv_layers
(default:null
): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,num_filters
,filter_size
,strides
,padding
,dilation_rate
,use_bias
,pool_function
,pool_padding
,pool_size
,pool_strides
,bias_initializer
,weights_initializer
. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If bothconv_layers
andnum_conv_layers
arenull
, a default list will be assigned toconv_layers
with the value[{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}]
. -
pool_function
(default:max
): Pooling function to use.max
will select the maximum value. Any ofaverage
,avg
, ormean
will compute the mean value Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
. pool_size
(default:null
): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along thes
sequence dimension after the convolution operation.dilation_rate
(default:1
): Dilation rate to use for dilated convolution.pool_strides
(default:null
): Factor to scale down.pool_padding
(default:same
): Padding to use. Options:valid
,same
.unit_forget_bias
(default:true
): If true, add 1 to the bias of the forget gate at initialization Options:true
,false
.recurrent_initializer
(default:orthogonal
): The initializer for recurrent matrix weights Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
.use_bias
(default:true
): Whether to use a bias vector. Options:true
,false
.-
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
. embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
.reduce_output
(default:last
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.norm_params
(default:null
): Default parameters passed to thenorm
module.-
fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead. -
num_filters
(default:256
): Number of filters, and by consequence number of output channels of the 1d convolution. padding
(default:same
): Padding to use. Options:valid
,same
.num_rec_layers
(default:1
): The number of stacked recurrent layers.-
bidirectional
(default:false
): If true, two recurrent networks will perform encoding in the forward and backward direction and their outputs will be concatenated. Options:true
,false
. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
.
Transformer Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C["Transformer\n Blocks"];
C --> D["Fully\n Connected\n Layers"];
D --> ...;
The transformer
encoder implements a stack of transformer blocks, replicating the architecture introduced in the
Attention is all you need paper, and adds am optional stack of fully connected
layers at the end.
encoder:
type: transformer
dropout: 0.1
num_layers: 1
embedding_size: 256
output_size: 256
norm: null
num_fc_layers: 0
fc_dropout: 0.0
hidden_size: 256
transformer_output_size: 256
fc_activation: relu
representation: dense
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: last
norm_params: null
fc_layers: null
num_heads: 8
pretrained_embeddings: null
Parameters:
dropout
(default:0.1
) : The dropout rate for the transformer block. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).num_layers
(default:1
) : The number of transformer layers.embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).output_size
(default:256
) : The default output_size that will be used for each layer.norm
(default:null
) : The default norm that will be used for each layer. Options:batch
,layer
,ghost
,null
.num_fc_layers
(default:0
) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.fc_dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).hidden_size
(default:256
): The size of the hidden representation within the transformer block. It is usually the same as the embedding_size, but if the two values are different, a projection layer will be added before the first transformer block.transformer_output_size
(default:256
): Size of the fully connected layer after self attention in the transformer block. This is usually the same as hidden_size and embedding_size.fc_activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.use_bias
(default:true
): Whether to use a bias vector. Options:true
,false
.-
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
. embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
.reduce_output
(default:last
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.norm_params
(default:null
): Default parameters passed to thenorm
module.-
fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead. -
num_heads
(default:8
): Number of attention heads in each transformer block. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
.
Output Features¶
Sequence output features can be used for either tagging (classifying each element of an input sequence) or
generation (generating a sequence by sampling from the model). Ludwig provides two sequence decoders named tagger
and
generator
respectively.
Example sequence output feature using default parameters:
name: seq_column_name
type: sequence
reduce_input: null
dependencies: []
reduce_dependencies: sum
loss:
type: softmax_cross_entropy
confidence_penalty: 0
robust_lambda: 0
class_weights: 1
class_similarities_temperature: 0
decoder:
type: generator
Parameters:
reduce_input
(defaultsum
): defines how to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the sequence dimension),last
(returns the last vector of the sequence dimension).dependencies
(default[]
): the output features this one is dependent on. For a detailed explanation refer to Output Feature Dependencies.reduce_dependencies
(defaultsum
): defines how to reduce the output of a dependent feature that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the sequence dimension),last
(returns the last vector of the sequence dimension).loss
(default{type: softmax_cross_entropy, class_similarities_temperature: 0, class_weights: 1, confidence_penalty: 0, robust_lambda: 0}
): is a dictionary containing a losstype
. The only available losstype
for sequences issoftmax_cross_entropy
. See Loss for details.decoder
(default:{"type": "generator"}
): Decoder for the desired task. Options:generator
,tagger
. See Decoder for details.
Decoder type and decoder parameters can also be defined once and applied to all sequence output features using the Type-Global Decoder section. Loss and loss related parameters can also be defined once in the same way.
Decoders¶
Generator¶
graph LR
A["Combiner Output"] --> B["Fully\n Connected\n Layers"];
B --> C1["RNN"] --> C2["RNN"] --> C3["RNN"];
GO(["GO"]) -.-o C1;
C1 -.-o O1("Output");
O1 -.-o C2;
C2 -.-o O2("Output");
O2 -.-o C3;
C3 -.-o END(["END"]);
subgraph DEC["DECODER.."]
B
C1
C2
C3
end
In the case of generator
the decoder is a (potentially empty) stack of fully connected layers, followed by an rnn that
generates outputs feeding on its own previous predictions and generates a tensor of size b x s' x c
, where b
is the
batch size, s'
is the length of the generated sequence and c
is the number of classes, followed by a
softmax_cross_entropy.
During training teacher forcing is adopted, meaning the list of targets is provided as both inputs and outputs (shifted
by 1), while at evaluation time greedy decoding (generating one token at a time and feeding it as input for the next
step) is performed by beam search, using a beam of 1 by default.
In general a generator expects a b x h
shaped input tensor, where h
is a hidden dimension.
The h
vectors are (after an optional stack of fully connected layers) fed into the rnn generator.
One exception is when the generator uses attention, as in that case the expected size of the input tensor is
b x s x h
, which is the output of a sequence, text or time series input feature without reduced outputs or the output
of a sequence-based combiner.
If a b x h
input is provided to a generator decoder using an rnn with attention instead, an error will be raised
during model building.
decoder:
type: generator
num_fc_layers: 0
fc_output_size: 256
fc_norm: null
fc_dropout: 0.0
cell_type: gru
num_layers: 1
fc_activation: relu
reduce_input: sum
fc_layers: null
fc_use_bias: true
fc_weights_initializer: xavier_uniform
fc_bias_initializer: zeros
fc_norm_params: null
Parameters:
num_fc_layers
(default:0
) : Number of fully-connected layers iffc_layers
not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.fc_output_size
(default:256
) : Output size of fully connected stack.fc_norm
(default:null
) : Default normalization applied at the beginnging of fully connected layers. Options:batch
,layer
,ghost
,null
.fc_dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).cell_type
(default:gru
) : Type of recurrent cell to use. Options:rnn
,lstm
,gru
.num_layers
(default:1
) : The number of stacked recurrent layers.fc_activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.reduce_input
(default:sum
): How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension) Options:sum
,mean
,avg
,max
,concat
,last
.fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.fc_use_bias
(default:true
): Whether the layer uses a bias vector in the fc_stack. Options:true
,false
.fc_weights_initializer
(default:xavier_uniform
): The weights initializer to use for the layers in the fc_stackfc_bias_initializer
(default:zeros
): The bias initializer to use for the layers in the fc_stackfc_norm_params
(default:null
): Default parameters passed to thenorm
module.
Tagger¶
graph LR
A["emb[0]\n....\nemb[n]"] --> B["Fully\n Connected\n Layers"];
B --> C["Projection\n....\nProjection"];
C --> D["Softmax\n....\nSoftmax"];
subgraph DEC["DECODER.."]
B
C
D
end
subgraph COM["COMBINER OUT.."]
A
end
In the case of tagger
the decoder is a (potentially empty) stack of fully connected layers, followed by a projection
into a tensor of size b x s x c
, where b
is the batch size, s
is the length of the sequence and c
is the number
of classes, followed by a softmax_cross_entropy.
This decoder requires its input to be shaped as b x s x h
, where h
is a hidden dimension, which is the output of a
sequence, text or time series input feature without reduced outputs or the output of a sequence-based combiner.
If a b x h
input is provided instead, an error will be raised during model building.
decoder:
type: tagger
num_fc_layers: 0
fc_output_size: 256
fc_norm: null
fc_dropout: 0.0
fc_activation: relu
attention_embedding_size: 256
fc_layers: null
fc_use_bias: true
fc_weights_initializer: xavier_uniform
fc_bias_initializer: zeros
fc_norm_params: null
use_attention: false
use_bias: true
attention_num_heads: 8
Parameters:
num_fc_layers
(default:0
) : Number of fully-connected layers iffc_layers
not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.fc_output_size
(default:256
) : Output size of fully connected stack.fc_norm
(default:null
) : Default normalization applied at the beginnging of fully connected layers. Options:batch
,layer
,ghost
,null
.fc_dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).fc_activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.attention_embedding_size
(default:256
): The embedding size of the multi-head self attention layer.fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.fc_use_bias
(default:true
): Whether the layer uses a bias vector in the fc_stack. Options:true
,false
.fc_weights_initializer
(default:xavier_uniform
): The weights initializer to use for the layers in the fc_stackfc_bias_initializer
(default:zeros
): The bias initializer to use for the layers in the fc_stackfc_norm_params
(default:null
): Default parameters passed to thenorm
module.use_attention
(default:false
): Whether to apply a multi-head self attention layer before prediction. Options:true
,false
.use_bias
(default:true
): Whether the layer uses a bias vector. Options:true
,false
.attention_num_heads
(default:8
): The number of attention heads in the multi-head self attention layer.
Loss¶
Sequence Softmax Cross Entropy¶
loss:
type: sequence_softmax_cross_entropy
class_weights: null
weight: 1.0
robust_lambda: 0
confidence_penalty: 0
class_similarities: null
class_similarities_temperature: 0
unique: false
Parameters:
class_weights
(default:null
) : Weights to apply to each class in the loss. If not specified, all classes are weighted equally. The value can be a vector of weights, one for each class, that is multiplied to the loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in case of unbalanced class distribution. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the<UNK>
class needs to be included too). Alternatively, the value can be a dictionary with class strings as keys and weights as values, like{class_a: 0.5, class_b: 0.7, ...}
.weight
(default:1.0
): Weight of the loss.robust_lambda
(default:0
): Replaces the loss with(1 - robust_lambda) * loss + robust_lambda / c
wherec
is the number of classes. Useful in case of noisy labels.confidence_penalty
(default:0
): Penalizes overconfident predictions (low entropy) by adding an additional term that penalizes too confident predictions by adding aa * (max_entropy - entropy) / max_entropy
term to the loss, where a is the value of this parameter. Useful in case of noisy labels.class_similarities
(default:null
): If notnull
it is ac x c
matrix in the form of a list of lists that contains the mutual similarity of classes. It is used ifclass_similarities_temperature
is greater than 0. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the<UNK>
class needs to be included too).class_similarities_temperature
(default:0
): The temperature parameter of the softmax that is performed on each row ofclass_similarities
. The output of that softmax is used to determine the supervision vector to provide instead of the one hot vector that would be provided otherwise for each datapoint. The intuition behind it is that errors between similar classes are more tolerable than errors between really different classes.unique
(default:false
): If true, the loss is only computed for unique elements in the sequence. Options:true
,false
.
Metrics¶
The metrics that are calculated every epoch and are available for sequence features are:
sequence_accuracy
The rate at which the model predicted the correct sequence.token_accuracy
The number of tokens correctly predicted divided by the total number of tokens in all sequences.last_accuracy
Accuracy considering only the last element of the sequence. Useful to ensure special end-of-sequence tokens are generated or tagged.edit_distance
Levenshtein distance: the minimum number of single-token edits (insertions, deletions or substitutions) required to change predicted sequence to ground truth.perplexity
Perplexity is the inverse of the predicted probability of the ground truth sequence, normalized by the number of tokens. The lower the perplexity, the higher the probability of predicting the true sequence.loss
The value of the loss function.
You can set any of the above as validation_metric
in the training
section of the configuration if validation_field
names a sequence feature.