β Text Features
Preprocessing¶
Text features are an extension of sequence features. Text inputs are processed by a tokenizer which maps the raw text input into a sequence of tokens. An integer id is assigned to each unique token. Using this mapping, each text string is converted first to a sequence of tokens, and next to a sequence of integers.
The list of tokens and their integer representations (vocabulary) is stored in the metadata of the model. In the case of a text output feature, this same mapping is used to post-process predictions to text.
preprocessing:
tokenizer: space_punct
max_sequence_length: 256
missing_value_strategy: fill_with_const
most_common: 20000
lowercase: false
fill_value: <UNK>
ngram_size: 2
padding_symbol: <PAD>
unknown_symbol: <UNK>
padding: right
cache_encoder_embeddings: false
vocab_file: null
sequence_length: null
prompt:
template: null
task: null
retrieval:
type: null
index_name: null
model_name: null
k: 0
Parameters:
tokenizer
(default:space_punct
) : Defines how to map from the raw string content of the dataset column to a sequence of elements. Options:space
,space_punct
,ngram
,characters
,underscore
,comma
,untokenized
,stripped
,english_tokenize
,english_tokenize_filter
,english_tokenize_remove_stopwords
,english_lemmatize
,english_lemmatize_filter
,english_lemmatize_remove_stopwords
,italian_tokenize
,italian_tokenize_filter
,italian_tokenize_remove_stopwords
,italian_lemmatize
,italian_lemmatize_filter
,italian_lemmatize_remove_stopwords
,spanish_tokenize
,spanish_tokenize_filter
,spanish_tokenize_remove_stopwords
,spanish_lemmatize
,spanish_lemmatize_filter
,spanish_lemmatize_remove_stopwords
,german_tokenize
,german_tokenize_filter
,german_tokenize_remove_stopwords
,german_lemmatize
,german_lemmatize_filter
,german_lemmatize_remove_stopwords
,french_tokenize
,french_tokenize_filter
,french_tokenize_remove_stopwords
,french_lemmatize
,french_lemmatize_filter
,french_lemmatize_remove_stopwords
,portuguese_tokenize
,portuguese_tokenize_filter
,portuguese_tokenize_remove_stopwords
,portuguese_lemmatize
,portuguese_lemmatize_filter
,portuguese_lemmatize_remove_stopwords
,dutch_tokenize
,dutch_tokenize_filter
,dutch_tokenize_remove_stopwords
,dutch_lemmatize
,dutch_lemmatize_filter
,dutch_lemmatize_remove_stopwords
,greek_tokenize
,greek_tokenize_filter
,greek_tokenize_remove_stopwords
,greek_lemmatize
,greek_lemmatize_filter
,greek_lemmatize_remove_stopwords
,norwegian_tokenize
,norwegian_tokenize_filter
,norwegian_tokenize_remove_stopwords
,norwegian_lemmatize
,norwegian_lemmatize_filter
,norwegian_lemmatize_remove_stopwords
,lithuanian_tokenize
,lithuanian_tokenize_filter
,lithuanian_tokenize_remove_stopwords
,lithuanian_lemmatize
,lithuanian_lemmatize_filter
,lithuanian_lemmatize_remove_stopwords
,danish_tokenize
,danish_tokenize_filter
,danish_tokenize_remove_stopwords
,danish_lemmatize
,danish_lemmatize_filter
,danish_lemmatize_remove_stopwords
,polish_tokenize
,polish_tokenize_filter
,polish_tokenize_remove_stopwords
,polish_lemmatize
,polish_lemmatize_filter
,polish_lemmatize_remove_stopwords
,romanian_tokenize
,romanian_tokenize_filter
,romanian_tokenize_remove_stopwords
,romanian_lemmatize
,romanian_lemmatize_filter
,romanian_lemmatize_remove_stopwords
,japanese_tokenize
,japanese_tokenize_filter
,japanese_tokenize_remove_stopwords
,japanese_lemmatize
,japanese_lemmatize_filter
,japanese_lemmatize_remove_stopwords
,chinese_tokenize
,chinese_tokenize_filter
,chinese_tokenize_remove_stopwords
,chinese_lemmatize
,chinese_lemmatize_filter
,chinese_lemmatize_remove_stopwords
,multi_tokenize
,multi_tokenize_filter
,multi_tokenize_remove_stopwords
,multi_lemmatize
,multi_lemmatize_filter
,multi_lemmatize_remove_stopwords
,sentencepiece
,clip
,gpt2bpe
,bert
,hf_tokenizer
.max_sequence_length
(default:256
) : The maximum length (number of tokens) of the sequence. Sequences that are longer than this value will be truncated. Useful as a stopgap measure ifsequence_length
is set toNone
. IfNone
, max sequence length will be inferred from the training dataset.missing_value_strategy
(default:fill_with_const
) : What strategy to follow when there's a missing value in a text column. Options:fill_with_const
,fill_with_mode
,bfill
,ffill
,drop_row
. See Missing Value Strategy for details.most_common
(default:20000
): The maximum number of most common tokens in the vocabulary. If the data contains more than this amount, the most infrequent symbols will be treated as unknown.lowercase
(default:false
): If true, converts the string to lowercase before tokenizing. Options:true
,false
.fill_value
(default:<UNK>
): The value to replace missing values with in case themissing_value_strategy
isfill_with_const
.ngram_size
(default:2
): The size of the ngram when using thengram
tokenizer (e.g, 2 = bigram, 3 = trigram, etc.).padding_symbol
(default:<PAD>
): The string used as the padding symbol for sequence features. Ignored for features using huggingface encoders, which have their own vocabulary.unknown_symbol
(default:<UNK>
): The string used as the unknown symbol for sequence features. Ignored for features using huggingface encoders, which have their own vocabulary.padding
(default:right
): The direction of the padding. Options:left
,right
.-
cache_encoder_embeddings
(default:false
): For pretrained encoders, compute encoder embeddings in preprocessing, speeding up training time considerably. Only supported whenencoder.trainable=false
. Options:true
,false
. -
vocab_file
(default:null
): Filepath string to a UTF-8 encoded file containing the sequence's vocabulary. On each line the first string until\t
or\n
is considered a word. -
sequence_length
(default:null
): The desired length (number of tokens) of the sequence. Sequences that are longer than this value will be truncated and sequences shorter than this value will be padded. If None, sequence length will be inferred from the training dataset. -
prompt
: prompt.template
(default:null
) : The template to use for the prompt. Must contain at least one of the columns from the input dataset or__sample__
as a variable surrounded in curly brackets {} to indicate where to insert the current feature. Multiple columns can be inserted, e.g.:The {color} {animal} jumped over the {size} {object}
, where every term in curly brackets is a column in the dataset. If atask
is specified, then the template must also contain the__task__
variable. Ifretrieval
is specified, then the template must also contain the__context__
variable. If no template is provided, then a default will be used based on the retrieval settings, and a task must be set in the config.prompt.task
(default:null
) : The task to use for the prompt. Required iftemplate
is not set.prompt.retrieval
(default:{"type": null}
):
Preprocessing parameters can also be defined once and applied to all text input features using the Type-Global Preprocessing section.
Note
If a text feature's encoder specifies a huggingface model, then the tokenizer for that model will be used automatically.
Input Features¶
The encoder parameters specified at the feature level are:
tied
(defaultnull
): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.
Example text feature entry in the input features list:
name: text_column_name
type: text
tied: null
encoder:
type: bert
trainable: true
Parameters:
type
(defaultparallel_cnn
): encoder to use for the input text feature. The available encoders include encoders used for Sequence Features as well as pre-trained text encoders from the face transformers library:albert
,auto_transformer
,bert
,camembert
,ctrl
,distilbert
,electra
,flaubert
,gpt
,gpt2
,longformer
,roberta
,t5
,mt5
,transformer_xl
,xlm
,xlmroberta
,xlnet
.
Encoder type and encoder parameters can also be defined once and applied to all text input features using the Type-Global Encoder section.
Encoders¶
Embed Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C["Aggregation\n Reduce\n Operation"];
C --> ...;
The embed encoder simply maps each token in the input sequence to an embedding, creating a b x s x h
tensor where b
is the batch size, s
is the length of the sequence and h
is the embedding size.
The tensor is reduced along the s
dimension to obtain a single vector of size h
for each element of the batch.
If you want to output the full b x s x h
tensor, you can specify reduce_output: null
.
encoder:
type: embed
dropout: 0.0
embedding_size: 256
representation: dense
weights_initializer: uniform
reduce_output: sum
embeddings_on_cpu: false
embeddings_trainable: true
pretrained_embeddings: null
Parameters:
dropout
(default:0.0
) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.-
weights_initializer
(default:uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
reduce_output
(default:sum
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
. embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
.-
embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
.
Parallel CNN Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
C --> D1["1D Conv\n Width 2"] --> E1["Pool"];
C --> D2["1D Conv\n Width 3"] --> E2["Pool"];
C --> D3["1D Conv\n Width 4"] --> E3["Pool"];
C --> D4["1D Conv\n Width 5"] --> E4["Pool"];
E1 --> F["Concat"] --> G["Fully\n Connected\n Layers"] --> H["..."];
E2 --> F;
E3 --> F;
E4 --> F;
The parallel cnn encoder is inspired by
Yoon Kim's Convolutional Neural Network for Sentence Classification.
It works by first mapping the input token sequence b x s
(where b
is the batch size and s
is the length of the
sequence) into a sequence of embeddings, then it passes the embedding through a number of parallel 1d convolutional
layers with different filter size (by default 4 layers with filter size 2, 3, 4 and 5), followed by max pooling and
concatenation.
This single vector concatenating the outputs of the parallel convolutional layers is then passed through a stack of
fully connected layers and returned as a b x h
tensor where h
is the output size of the last fully connected layer.
If you want to output the full b x s x h
tensor, you can specify reduce_output: null
.
encoder:
type: parallel_cnn
dropout: 0.0
embedding_size: 256
num_conv_layers: null
output_size: 256
activation: relu
filter_size: 3
norm: null
representation: dense
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: sum
conv_layers: null
pool_function: max
pool_size: null
norm_params: null
num_fc_layers: null
fc_layers: null
pretrained_embeddings: null
num_filters: 256
Parameters:
dropout
(default:0.0
) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).num_conv_layers
(default:null
) : The number of stacked convolutional layers whenconv_layers
isnull
.output_size
(default:256
) : The default output_size that will be used for each layer.activation
(default:relu
): The default activation function that will be used for each layer. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.filter_size
(default:3
): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.norm
(default:null
): The default norm that will be used for each layer. Options:batch
,layer
,null
.representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.use_bias
(default:true
): Whether to use a bias vector. Options:true
,false
.-
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
. embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
.reduce_output
(default:sum
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.-
conv_layers
(default:null
): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,num_filters
,filter_size
,strides
,padding
,dilation_rate
,use_bias
,pool_function
,pool_padding
,pool_size
,pool_strides
,bias_initializer
,weights_initializer
. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If bothconv_layers
andnum_conv_layers
arenull
, a default list will be assigned toconv_layers
with the value[{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}]
. -
pool_function
(default:max
): Pooling function to use.max
will select the maximum value. Any ofaverage
,avg
, ormean
will compute the mean value Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
. pool_size
(default:null
): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along thes
sequence dimension after the convolution operation.norm_params
(default:null
): Parameters used if norm is eitherbatch
orlayer
.num_fc_layers
(default:null
): Number of parallel fully connected layers to use.-
fc_layers
(default:null
): List of dictionaries containing the parameters for each fully connected layer. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
. num_filters
(default:256
): Number of filters, and by consequence number of output channels of the 1d convolution.
Stacked CNN Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C["1D Conv Layers\n Different Widths"];
C --> D["Fully\n Connected\n Layers"];
D --> ...;
The stacked cnn encoder is inspired by Xiang Zhang at all's Character-level Convolutional Networks for Text Classification.
It works by first mapping the input token sequence b x s
(where b
is the batch size and s
is the length of the
sequence) into a sequence of embeddings, then it passes the embedding through a stack of 1d convolutional layers with
different filter size (by default 6 layers with filter size 7, 7, 3, 3, 3 and 3), followed by an optional final pool and
by a flatten operation.
This single flatten vector is then passed through a stack of fully connected layers and returned as a b x h
tensor
where h
is the output size of the last fully connected layer.
If you want to output the full b x s x h
tensor, you can specify the pool_size
of all your conv_layers
to be
null
and reduce_output: null
, while if pool_size
has a value different from null
and reduce_output: null
the
returned tensor will be of shape b x s' x h
, where s'
is width of the output of the last convolutional layer.
encoder:
type: stacked_cnn
dropout: 0.0
num_conv_layers: null
embedding_size: 256
output_size: 256
activation: relu
filter_size: 3
strides: 1
norm: null
representation: dense
conv_layers: null
pool_function: max
pool_size: null
dilation_rate: 1
pool_strides: null
pool_padding: same
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: sum
norm_params: null
num_fc_layers: null
fc_layers: null
num_filters: 256
padding: same
pretrained_embeddings: null
Parameters:
dropout
(default:0.0
) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).num_conv_layers
(default:null
) : The number of stacked convolutional layers whenconv_layers
isnull
.embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).output_size
(default:256
) : The default output_size that will be used for each layer.activation
(default:relu
): The default activation function that will be used for each layer. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.filter_size
(default:3
): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.strides
(default:1
): Stride length of the convolution.norm
(default:null
): The default norm that will be used for each layer. Options:batch
,layer
,null
.representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.-
conv_layers
(default:null
): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,num_filters
,filter_size
,strides
,padding
,dilation_rate
,use_bias
,pool_function
,pool_padding
,pool_size
,pool_strides
,bias_initializer
,weights_initializer
. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If bothconv_layers
andnum_conv_layers
arenull
, a default list will be assigned toconv_layers
with the value[{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}]
. -
pool_function
(default:max
): Pooling function to use.max
will select the maximum value. Any ofaverage
,avg
, ormean
will compute the mean value Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
. pool_size
(default:null
): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along thes
sequence dimension after the convolution operation.dilation_rate
(default:1
): Dilation rate to use for dilated convolution.pool_strides
(default:null
): Factor to scale down.pool_padding
(default:same
): Padding to use. Options:valid
,same
.use_bias
(default:true
): Whether to use a bias vector. Options:true
,false
.-
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
. embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
.reduce_output
(default:sum
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.norm_params
(default:null
): Parameters used if norm is eitherbatch
orlayer
.num_fc_layers
(default:null
): Number of parallel fully connected layers to use.-
fc_layers
(default:null
): List of dictionaries containing the parameters for each fully connected layer. -
num_filters
(default:256
): Number of filters, and by consequence number of output channels of the 1d convolution. -
padding
(default:same
): Padding to use. Options:valid
,same
. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
.
Stacked Parallel CNN Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
C --> D1["1D Conv\n Width 2"] --> E["Concat"];
C --> D2["1D Conv\n Width 3"] --> E;
C --> D3["1D Conv\n Width 4"] --> E;
C --> D4["1D Conv\n Width 5"] --> E;
E --> F["..."];
F --> G1["1D Conv\n Width 2"] --> H["Concat"];
F --> G2["1D Conv\n Width 3"] --> H;
F --> G3["1D Conv\n Width 4"] --> H;
F --> G4["1D Conv\n Width 5"] --> H;
H --> I["Pool"] --> J["Fully\n Connected\n Layers"] --> K["..."];
The stacked parallel cnn encoder is a combination of the Parallel CNN and the Stacked CNN encoders where each layer of
the stack is composed of parallel convolutional layers.
It works by first mapping the input token sequence b x s
(where b
is the batch size and s
is the length of the
sequence) into a sequence of embeddings, then it passes the embedding through a stack of several parallel 1d
convolutional layers with different filter size, followed by an optional final pool and by a flatten operation.
This single flattened vector is then passed through a stack of fully connected layers and returned as a b x h
tensor
where h
is the output size of the last fully connected layer.
If you want to output the full b x s x h
tensor, you can specify reduce_output: null
.
encoder:
type: stacked_parallel_cnn
dropout: 0.0
embedding_size: 256
output_size: 256
activation: relu
filter_size: 3
norm: null
representation: dense
num_stacked_layers: null
pool_function: max
pool_size: null
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: sum
norm_params: null
num_fc_layers: null
fc_layers: null
stacked_layers: null
num_filters: 256
pretrained_embeddings: null
Parameters:
dropout
(default:0.0
) : Dropout rate applied to the embedding. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).output_size
(default:256
) : The default output_size that will be used for each layer.activation
(default:relu
): The default activation function that will be used for each layer. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.filter_size
(default:3
): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.norm
(default:null
): The default norm that will be used for each layer. Options:batch
,layer
,null
.representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.num_stacked_layers
(default:null
): If stacked_layers is null, this is the number of elements in the stack of parallel convolutional layers.pool_function
(default:max
): Pooling function to use.max
will select the maximum value. Any ofaverage
,avg
, ormean
will compute the mean value Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.pool_size
(default:null
): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along thes
sequence dimension after the convolution operation.use_bias
(default:true
): Whether to use a bias vector. Options:true
,false
.-
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
. embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
.reduce_output
(default:sum
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.norm_params
(default:null
): Parameters used if norm is eitherbatch
orlayer
.num_fc_layers
(default:null
): Number of parallel fully connected layers to use.-
fc_layers
(default:null
): List of dictionaries containing the parameters for each fully connected layer. -
stacked_layers
(default:null
): a nested list of lists of dictionaries containing the parameters of the stack of parallel convolutional layers. The length of the list determines the number of stacked parallel convolutional layers, length of the sub-lists determines the number of parallel conv layers and the content of each dictionary determines the parameters for a specific layer. -
num_filters
(default:256
): Number of filters, and by consequence number of output channels of the 1d convolution. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
.
RNN Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C["RNN Layers"];
C --> D["Fully\n Connected\n Layers"];
D --> ...;
The rnn encoder works by first mapping the input token sequence b x s
(where b
is the batch size and s
is the
length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of recurrent layers
(by default 1 layer), followed by a reduce operation that by default only returns the last output, but can perform other
reduce functions.
If you want to output the full b x s x h
where h
is the size of the output of the last rnn layer, you can specify
reduce_output: null
.
encoder:
type: rnn
dropout: 0.0
cell_type: rnn
num_layers: 1
state_size: 256
embedding_size: 256
output_size: 256
norm: null
num_fc_layers: 0
fc_dropout: 0.0
recurrent_dropout: 0.0
activation: tanh
fc_activation: relu
recurrent_activation: sigmoid
representation: dense
unit_forget_bias: true
recurrent_initializer: orthogonal
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: last
norm_params: null
fc_layers: null
bidirectional: false
pretrained_embeddings: null
Parameters:
dropout
(default:0.0
) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).cell_type
(default:rnn
) : The type of recurrent cell to use. Available values are:rnn
,lstm
,gru
. For reference about the differences between the cells please refer to torch.nn Recurrent Layers. Options:rnn
,lstm
,gru
.num_layers
(default:1
) : The number of stacked recurrent layers.state_size
(default:256
) : The size of the state of the rnn.embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).output_size
(default:256
) : The default output_size that will be used for each layer.norm
(default:null
) : The default norm that will be used for each layer. Options:batch
,layer
,ghost
,null
.num_fc_layers
(default:0
) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.fc_dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).recurrent_dropout
(default:0.0
): The dropout rate for the recurrent stateactivation
(default:tanh
): The default activation function. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.fc_activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.recurrent_activation
(default:sigmoid
): The activation function to use in the recurrent step Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.unit_forget_bias
(default:true
): If true, add 1 to the bias of the forget gate at initialization Options:true
,false
.recurrent_initializer
(default:orthogonal
): The initializer for recurrent matrix weights Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
.use_bias
(default:true
): Whether to use a bias vector. Options:true
,false
.-
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
. embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
.reduce_output
(default:last
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.norm_params
(default:null
): Default parameters passed to thenorm
module.-
fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead. -
bidirectional
(default:false
): If true, two recurrent networks will perform encoding in the forward and backward direction and their outputs will be concatenated. Options:true
,false
. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
.
CNN RNN Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C1["CNN Layers"];
C1 --> C2["RNN Layers"];
C2 --> D["Fully\n Connected\n Layers"];
D --> ...;
The cnnrnn
encoder works by first mapping the input token sequence b x s
(where b
is the batch size and s
is
the length of the sequence) into a sequence of embeddings, then it passes the embedding through a stack of convolutional
layers (by default 2), that is followed by a stack of recurrent layers (by default 1), followed by a reduce operation
that by default only returns the last output, but can perform other reduce functions.
If you want to output the full b x s x h
where h
is the size of the output of the last rnn layer, you can specify
reduce_output: null
.
encoder:
type: cnnrnn
dropout: 0.0
conv_dropout: 0.0
cell_type: rnn
num_conv_layers: null
state_size: 256
embedding_size: 256
output_size: 256
norm: null
num_fc_layers: 0
fc_dropout: 0.0
recurrent_dropout: 0.0
activation: tanh
filter_size: 5
strides: 1
fc_activation: relu
recurrent_activation: sigmoid
conv_activation: relu
representation: dense
conv_layers: null
pool_function: max
pool_size: null
dilation_rate: 1
pool_strides: null
pool_padding: same
unit_forget_bias: true
recurrent_initializer: orthogonal
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: last
norm_params: null
fc_layers: null
num_filters: 256
padding: same
num_rec_layers: 1
bidirectional: false
pretrained_embeddings: null
Parameters:
dropout
(default:0.0
) : Dropout rate. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).conv_dropout
(default:0.0
) : The dropout rate for the convolutional layerscell_type
(default:rnn
) : The type of recurrent cell to use. Available values are:rnn
,lstm
,gru
. For reference about the differences between the cells please refer to torch.nn Recurrent Layers. Options:rnn
,lstm
,gru
.num_conv_layers
(default:null
) : The number of stacked convolutional layers whenconv_layers
isnull
.state_size
(default:256
) : The size of the state of the rnn.embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).output_size
(default:256
) : The default output_size that will be used for each layer.norm
(default:null
) : The default norm that will be used for each layer. Options:batch
,layer
,ghost
,null
.num_fc_layers
(default:0
) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.fc_dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).recurrent_dropout
(default:0.0
): The dropout rate for the recurrent stateactivation
(default:tanh
): The default activation function to use. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.filter_size
(default:5
): Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.strides
(default:1
): Stride length of the convolution.fc_activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.recurrent_activation
(default:sigmoid
): The activation function to use in the recurrent step Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.conv_activation
(default:relu
): The default activation function that will be used for each convolutional layer. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.-
conv_layers
(default:null
): A list of dictionaries containing the parameters of all the convolutional layers. The length of the list determines the number of stacked convolutional layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,num_filters
,filter_size
,strides
,padding
,dilation_rate
,use_bias
,pool_function
,pool_padding
,pool_size
,pool_strides
,bias_initializer
,weights_initializer
. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If bothconv_layers
andnum_conv_layers
arenull
, a default list will be assigned toconv_layers
with the value[{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}]
. -
pool_function
(default:max
): Pooling function to use.max
will select the maximum value. Any ofaverage
,avg
, ormean
will compute the mean value Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
. pool_size
(default:null
): The default pool_size that will be used for each layer. If a pool_size is not already specified in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of the max pooling that will be performed along thes
sequence dimension after the convolution operation.dilation_rate
(default:1
): Dilation rate to use for dilated convolution.pool_strides
(default:null
): Factor to scale down.pool_padding
(default:same
): Padding to use. Options:valid
,same
.unit_forget_bias
(default:true
): If true, add 1 to the bias of the forget gate at initialization Options:true
,false
.recurrent_initializer
(default:orthogonal
): The initializer for recurrent matrix weights Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
.use_bias
(default:true
): Whether to use a bias vector. Options:true
,false
.-
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
. embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
.reduce_output
(default:last
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.norm_params
(default:null
): Default parameters passed to thenorm
module.-
fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead. -
num_filters
(default:256
): Number of filters, and by consequence number of output channels of the 1d convolution. padding
(default:same
): Padding to use. Options:valid
,same
.num_rec_layers
(default:1
): The number of stacked recurrent layers.-
bidirectional
(default:false
): If true, two recurrent networks will perform encoding in the forward and backward direction and their outputs will be concatenated. Options:true
,false
. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
.
Transformer Encoder¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
B --> C["Transformer\n Blocks"];
C --> D["Fully\n Connected\n Layers"];
D --> ...;
The transformer
encoder implements a stack of transformer blocks, replicating the architecture introduced in the
Attention is all you need paper, and adds am optional stack of fully connected
layers at the end.
encoder:
type: transformer
dropout: 0.1
num_layers: 1
embedding_size: 256
output_size: 256
norm: null
num_fc_layers: 0
fc_dropout: 0.0
hidden_size: 256
transformer_output_size: 256
fc_activation: relu
representation: dense
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
reduce_output: last
norm_params: null
fc_layers: null
num_heads: 8
pretrained_embeddings: null
Parameters:
dropout
(default:0.1
) : The dropout rate for the transformer block. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).num_layers
(default:1
) : The number of transformer layers.embedding_size
(default:256
) : The maximum embedding size. The actual size will bemin(vocabulary_size, embedding_size)
fordense
representations and exactlyvocabulary_size
for thesparse
encoding, wherevocabulary_size
is the number of unique strings appearing in the training set input column plus the number of special tokens (<UNK>
,<PAD>
,<SOS>
,<EOS>
).output_size
(default:256
) : The default output_size that will be used for each layer.norm
(default:null
) : The default norm that will be used for each layer. Options:batch
,layer
,ghost
,null
.num_fc_layers
(default:0
) : Number of parallel fully connected layers to use. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.fc_dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).hidden_size
(default:256
): The size of the hidden representation within the transformer block. It is usually the same as the embedding_size, but if the two values are different, a projection layer will be added before the first transformer block.transformer_output_size
(default:256
): Size of the fully connected layer after self attention in the transformer block. This is usually the same as hidden_size and embedding_size.fc_activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.representation
(default:dense
): Representation of the embedding.dense
means the embeddings are initialized randomly,sparse
means they are initialized to be one-hot encodings. Options:dense
,sparse
.use_bias
(default:true
): Whether to use a bias vector. Options:true
,false
.-
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. -
embeddings_on_cpu
(default:false
): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
. embeddings_trainable
(default:true
): Iftrue
embeddings are trained during the training process, iffalse
embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only whenrepresentation
isdense
;sparse
one-hot encodings are not trainable. Options:true
,false
.reduce_output
(default:last
): How to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.norm_params
(default:null
): Default parameters passed to thenorm
module.-
fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead. -
num_heads
(default:8
): Number of attention heads in each transformer block. -
pretrained_embeddings
(default:null
): Path to a file containing pretrained embeddings. By defaultdense
embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only ifrepresentation
isdense
.
Huggingface encoders¶
All huggingface-based text encoders are configured with the following parameters:
pretrained_model_name_or_path
(default is the huggingface default model path for the specified encoder, i.e.bert-base-uncased
for BERT). This can be either the name of a model or a path where it was downloaded. For details on the variants available refer to the Hugging Face documentation.reduce_output
(defaultcls_pooled
): defines how to reduce the output tensor along thes
sequence length dimension if the rank of the tensor is greater than 2. Available values are:cls_pooled
,sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension) andnull
(which does not reduce and returns the full tensor).trainable
(defaultfalse
): iftrue
the weights of the encoder will be trained, otherwise they will be kept frozen.
Note
Any hyperparameter of any huggingface encoder can be overridden. Check the huggingface documentation for which parameters are used for which models.
name: text_column_name
type: text
encoder: bert
trainable: true
num_attention_heads: 16 # Instead of 12
AutoTransformer¶
The auto_transformer
encoder automatically instantiates the model architecture for the specified pretrained_model_name_or_path
. Unlike the other HF encoders, auto_transformer
does not provide a default value for pretrained_model_name_or_path
, this is its only mandatory parameter. See the Hugging Face AutoModels documentation for more details.
encoder:
type: auto_transformer
pretrained_model_name_or_path: bert
trainable: false
reduce_output: sum
pretrained_kwargs: null
adapter: null
Parameters:
pretrained_model_name_or_path
(default:null
) : Name or path of the pretrained model.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.reduce_output
(default:sum
): The method used to reduce a sequence of tensors down to a single tensor. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.-
pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning
ALBERT¶
The albert
encoder loads a pretrained ALBERT (default albert-base-v2
) model using the Hugging Face transformers package. Albert is similar to BERT, with significantly lower memory usage and somewhat faster training time:.
encoder:
type: albert
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
reduce_output: cls_pooled
embedding_size: 128
hidden_size: 768
num_hidden_layers: 12
num_hidden_groups: 1
num_attention_heads: 12
intermediate_size: 3072
inner_group_num: 1
hidden_act: gelu_new
hidden_dropout_prob: 0.0
attention_probs_dropout_prob: 0.0
max_position_embeddings: 512
type_vocab_size: 2
initializer_range: 0.02
layer_norm_eps: 1.0e-12
classifier_dropout_prob: 0.1
position_embedding_type: absolute
pad_token_id: 0
bos_token_id: 2
eos_token_id: 3
pretrained_kwargs: null
adapter: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:albert-base-v2
): Name or path of the pretrained model.reduce_output
(default:cls_pooled
): The method used to reduce a sequence of tensors down to a single tensor.embedding_size
(default:128
): Dimensionality of vocabulary embeddings.hidden_size
(default:768
): Dimensionality of the encoder layers and the pooler layer.num_hidden_layers
(default:12
): Number of hidden layers in the Transformer encoder.num_hidden_groups
(default:1
): Number of groups for the hidden layers, parameters in the same group are shared.num_attention_heads
(default:12
): Number of attention heads for each attention layer in the Transformer encoder.intermediate_size
(default:3072
): The dimensionality of the βintermediateβ (often named feed-forward) layer in the Transformer encoder.inner_group_num
(default:1
): The number of inner repetition of attention and ffn.hidden_act
(default:gelu_new
): The non-linear activation function (function or string) in the encoder and pooler. Options:gelu
,relu
,silu
,gelu_new
.hidden_dropout_prob
(default:0.0
): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.attention_probs_dropout_prob
(default:0.0
): The dropout ratio for the attention probabilities.max_position_embeddings
(default:512
): The maximum sequence length that this model might ever be used with. Typically set this to something large (e.g., 512 or 1024 or 2048).type_vocab_size
(default:2
): The vocabulary size of the token_type_ids passed when calling AlbertModel or TFAlbertModel.initializer_range
(default:0.02
): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.layer_norm_eps
(default:1e-12
): The epsilon used by the layer normalization layers.classifier_dropout_prob
(default:0.1
): The dropout ratio for attached classifiers.position_embedding_type
(default:absolute
): Options:absolute
,relative_key
,relative_key_query
.pad_token_id
(default:0
): The ID of the token to use as padding.bos_token_id
(default:2
): The beginning of sequence token ID.eos_token_id
(default:3
): The end of sequence token ID.-
pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning
BERT¶
The bert encoder loads a pretrained BERT (default bert-base-uncased) model using the Hugging Face transformers package. BERT is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.
encoder:
type: bert
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
hidden_dropout_prob: 0.1
attention_probs_dropout_prob: 0.1
max_position_embeddings: 512
classifier_dropout: null
reduce_output: cls_pooled
hidden_size: 768
num_hidden_layers: 12
num_attention_heads: 12
intermediate_size: 3072
hidden_act: gelu
type_vocab_size: 2
initializer_range: 0.02
layer_norm_eps: 1.0e-12
pad_token_id: 0
gradient_checkpointing: false
position_embedding_type: absolute
pretrained_kwargs: null
adapter: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:bert-base-uncased
): Name or path of the pretrained model.hidden_dropout_prob
(default:0.1
): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.attention_probs_dropout_prob
(default:0.1
): The dropout ratio for the attention probabilities.max_position_embeddings
(default:512
): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).classifier_dropout
(default:null
): The dropout ratio for the classification head.reduce_output
(default:cls_pooled
): The method used to reduce a sequence of tensors down to a single tensor.hidden_size
(default:768
): Dimensionality of the encoder layers and the pooler layer.num_hidden_layers
(default:12
): Number of hidden layers in the Transformer encoder.num_attention_heads
(default:12
): Number of attention heads for each attention layer in the Transformer encoder.intermediate_size
(default:3072
): Dimensionality of the βintermediateβ (often named feed-forward) layer in the Transformer encoder.hidden_act
(default:gelu
): The non-linear activation function (function or string) in the encoder and pooler. Options:gelu
,relu
,silu
,gelu_new
.type_vocab_size
(default:2
): The vocabulary size of the token_type_ids passed when calling BertModel or TFBertModel.initializer_range
(default:0.02
): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.layer_norm_eps
(default:1e-12
): The epsilon used by the layer normalization layers.pad_token_id
(default:0
): The ID of the token to use as padding.gradient_checkpointing
(default:false
): Whether to use gradient checkpointing. Options:true
,false
.position_embedding_type
(default:absolute
): Type of position embedding. Options:absolute
,relative_key
,relative_key_query
.-
pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning
CamemBERT¶
The camembert
encoder loads a pretrained CamemBERT (default jplu/tf-camembert-base
) model using the Hugging Face transformers package. CamemBERT is pre-trained on a large French language web-crawled text corpus.
encoder:
type: camembert
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
hidden_dropout_prob: 0.1
attention_probs_dropout_prob: 0.1
max_position_embeddings: 514
classifier_dropout: null
reduce_output: sum
hidden_size: 768
hidden_act: gelu
initializer_range: 0.02
adapter: null
num_hidden_layers: 12
num_attention_heads: 12
intermediate_size: 3072
type_vocab_size: 1
layer_norm_eps: 1.0e-05
pad_token_id: 1
gradient_checkpointing: false
position_embedding_type: absolute
pretrained_kwargs: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:camembert-base
): Name or path of the pretrained model.hidden_dropout_prob
(default:0.1
): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.attention_probs_dropout_prob
(default:0.1
): The dropout ratio for the attention probabilities.max_position_embeddings
(default:514
): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).classifier_dropout
(default:null
): The dropout ratio for the classification head.reduce_output
(default:sum
): The method used to reduce a sequence of tensors down to a single tensor.hidden_size
(default:768
): Dimensionality of the encoder layers and the pooler layer.hidden_act
(default:gelu
): The non-linear activation function (function or string) in the encoder and pooler. Options:gelu
,relu
,silu
,gelu_new
.-
initializer_range
(default:0.02
): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning -
num_hidden_layers
(default:12
): Number of hidden layers in the Transformer encoder. num_attention_heads
(default:12
): Number of attention heads for each attention layer in the Transformer encoder.intermediate_size
(default:3072
): Dimensionality of the βintermediateβ (often named feed-forward) layer in the Transformer encoder.type_vocab_size
(default:1
): The vocabulary size of the token_type_ids passed when calling BertModel or TFBertModel.layer_norm_eps
(default:1e-05
): The epsilon used by the layer normalization layers.pad_token_id
(default:1
): The ID of the token to use as padding.gradient_checkpointing
(default:false
): Whether to use gradient checkpointing. Options:true
,false
.position_embedding_type
(default:absolute
): Type of position embedding. Options:absolute
,relative_key
,relative_key_query
.pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model.
DeBERTa¶
The DeBERTa encoder improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out performs RoBERTa on a majority of NLU tasks with 80GB training data. In DeBERTa V3, the authors further improved the efficiency of DeBERTa using ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing. Compared to DeBERTa, the V3 version significantly improves the model performance on downstream tasks.
encoder:
type: deberta
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
hidden_size: 1536
num_hidden_layers: 24
num_attention_heads: 24
intermediate_size: 6144
hidden_act: gelu
hidden_dropout_prob: 0.1
attention_probs_dropout_prob: 0.1
max_position_embeddings: 512
type_vocab_size: 0
initializer_range: 0.02
layer_norm_eps: 1.0e-12
relative_attention: true
max_relative_positions: -1
pad_token_id: 0
position_biased_input: false
pos_att_type:
- p2c
- c2p
pooler_hidden_size: 1536
pooler_dropout: 0
pooler_hidden_act: gelu
position_buckets: 256
share_att_key: true
norm_rel_ebd: layer_norm
adapter: null
pretrained_kwargs: null
reduce_output: sum
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.-
pretrained_model_name_or_path
(default:sileod/deberta-v3-base-tasksource-nli
): Name or path of the pretrained model. -
hidden_size
(default:1536
): Dimensionality of the encoder layers and the pooler layer. num_hidden_layers
(default:24
): Number of hidden layers in the Transformer encoder.num_attention_heads
(default:24
): Number of attention heads for each attention layer in the Transformer encoder.intermediate_size
(default:6144
): Dimensionality of the 'intermediate' (often named feed-forward) layer in the Transformer encoder.hidden_act
(default:gelu
): The non-linear activation function (function or string) in the encoder and pooler. Options:gelu
,relu
,silu
,tanh
,gelu_fast
,mish
,linear
,sigmoid
,gelu_new
.hidden_dropout_prob
(default:0.1
): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.attention_probs_dropout_prob
(default:0.1
): The dropout ratio for the attention probabilities.max_position_embeddings
(default:512
): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).type_vocab_size
(default:0
): The vocabulary size of thetoken_type_ids
.initializer_range
(default:0.02
): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.layer_norm_eps
(default:1e-12
): The epsilon used by the layer normalization layers.relative_attention
(default:true
): Whether use relative position encoding. Options:true
,false
.max_relative_positions
(default:-1
): The range of relative positions[-max_position_embeddings, max_position_embeddings]
. Use the same value asmax_position_embeddings
.pad_token_id
(default:0
): The value used to pad input_ids.position_biased_input
(default:false
): Whether add absolute position embedding to content embedding. Options:true
,false
.pos_att_type
(default:["p2c", "c2p"]
): The type of relative position attention, it can be a combination of['p2c', 'c2p']
, e.g.['p2c']
,['p2c', 'c2p']
,['p2c', 'c2p']
.pooler_hidden_size
(default:1536
): The hidden size of the pooler layers.pooler_dropout
(default:0
): The dropout ratio for the pooler layers.pooler_hidden_act
(default:gelu
): The activation function (function or string) in the pooler. Options:gelu
,relu
,silu
,tanh
,gelu_fast
,mish
,linear
,sigmoid
,gelu_new
.position_buckets
(default:256
): The number of buckets to use for each attention layer.share_att_key
(default:true
): Whether to share attention key across layers. Options:true
,false
.-
norm_rel_ebd
(default:layer_norm
): The normalization method for relative embeddings. Options:layer_norm
,none
. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning -
pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model. -
reduce_output
(default:sum
): The method used to reduce a sequence of tensors down to a single tensor. Options:cls_pooled
,last
,sum
,mean
,max
,concat
,attention
,null
.
DistilBERT¶
The distilbert
encoder loads a pretrained DistilBERT (default distilbert-base-uncased
) model using the Hugging Face transformers package. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERTβs performances as measured on the GLUE language understanding benchmark.
encoder:
type: distilbert
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
dropout: 0.1
max_position_embeddings: 512
attention_dropout: 0.1
activation: gelu
reduce_output: sum
initializer_range: 0.02
qa_dropout: 0.1
seq_classif_dropout: 0.2
adapter: null
sinusoidal_pos_embds: false
n_layers: 6
n_heads: 12
dim: 768
hidden_dim: 3072
pretrained_kwargs: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:distilbert-base-uncased
): Name or path of the pretrained model.dropout
(default:0.1
): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.max_position_embeddings
(default:512
): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).attention_dropout
(default:0.1
): The dropout ratio for the attention probabilities.activation
(default:gelu
): The non-linear activation function (function or string) in the encoder and pooler. If string, 'gelu', 'relu', 'silu' and 'gelu_new' are supported. Options:gelu
,relu
,silu
,gelu_new
.reduce_output
(default:sum
): The method used to reduce a sequence of tensors down to a single tensor.initializer_range
(default:0.02
): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.qa_dropout
(default:0.1
): The dropout probabilities used in the question answering model DistilBertForQuestionAnswering.-
seq_classif_dropout
(default:0.2
): The dropout probabilities used in the sequence classification and the multiple choice model DistilBertForSequenceClassification. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning -
sinusoidal_pos_embds
(default:false
): Whether to use sinusoidal positional embeddings. Options:true
,false
. n_layers
(default:6
): Number of hidden layers in the Transformer encoder.n_heads
(default:12
): Number of hidden layers in the Transformer encoder.dim
(default:768
): Dimensionality of the encoder layers and the pooler layer.hidden_dim
(default:3072
): The size of the βintermediateβ (often named feed-forward) layer in the Transformer encoder.pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model.
ELECTRA¶
The `electra`` encoder loads a pretrained ELECTRA model using the Hugging Face transformers package. ELECTRA is a new pretraining approach which trains two transformer models the generator and the discriminator. The generatorβs role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator, which is the model weβre interested in, tries to identify which tokens were replaced by the generator in the sequence.
encoder:
type: electra
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
hidden_dropout_prob: 0.1
attention_probs_dropout_prob: 0.1
max_position_embeddings: 512
classifier_dropout: null
reduce_output: sum
embedding_size: 128
hidden_size: 256
hidden_act: gelu
initializer_range: 0.02
adapter: null
num_hidden_layers: 12
num_attention_heads: 4
intermediate_size: 1024
type_vocab_size: 2
layer_norm_eps: 1.0e-12
position_embedding_type: absolute
pretrained_kwargs: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:google/electra-small-discriminator
): Name or path of the pretrained model.hidden_dropout_prob
(default:0.1
): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.attention_probs_dropout_prob
(default:0.1
): The dropout ratio for the attention probabilities.max_position_embeddings
(default:512
): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).classifier_dropout
(default:null
): The dropout ratio for the classification head.reduce_output
(default:sum
): The method used to reduce a sequence of tensors down to a single tensor.embedding_size
(default:128
): Dimensionality of the encoder layers and the pooler layer.hidden_size
(default:256
): Dimensionality of the encoder layers and the pooler layer.hidden_act
(default:gelu
): The non-linear activation function (function or string) in the encoder and pooler. Options:gelu
,relu
,silu
,gelu_new
.-
initializer_range
(default:0.02
): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning -
num_hidden_layers
(default:12
): Number of hidden layers in the Transformer encoder. num_attention_heads
(default:4
): Number of attention heads for each attention layer in the Transformer encoder.intermediate_size
(default:1024
): Dimensionality of the βintermediateβ (i.e., feed-forward) layer in the Transformer encoder.type_vocab_size
(default:2
): The vocabulary size of the token_type_ids passed when calling ElectraModel or TFElectraModel.layer_norm_eps
(default:1e-12
): The epsilon used by the layer normalization layers.position_embedding_type
(default:absolute
): Type of position embedding. Options:absolute
,relative_key
,relative_key_query
.pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model.
FlauBERT¶
The flaubert`` encoder loads a pretrained [FlauBERT](https://arxiv.org/abs/1912.05372) (default
jplu/tf-flaubert-base-uncased``) model using the Hugging Face transformers package. FlauBERT has an architecture similar to BERT and is pre-trained on a large French language corpus.
encoder:
type: flaubert
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
dropout: 0.1
reduce_output: sum
pre_norm: true
layerdrop: 0.2
emb_dim: 512
n_layers: 6
n_heads: 8
attention_dropout: 0.1
gelu_activation: true
sinusoidal_embeddings: false
causal: false
asm: false
n_langs: 1
use_lang_emb: true
max_position_embeddings: 512
embed_init_std: 0.02209708691207961
init_std: 0.02
layer_norm_eps: 1.0e-06
bos_index: 0
eos_index: 1
pad_index: 2
unk_index: 3
mask_index: 5
is_encoder: true
mask_token_id: 0
lang_id: 0
pretrained_kwargs: null
adapter: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:flaubert/flaubert_small_cased
): Name of path of the pretrained model.dropout
(default:0.1
): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.reduce_output
(default:sum
): The method used to reduce a sequence of tensors down to a single tensor.pre_norm
(default:true
): Whether to apply the layer normalization before or after the feed forward layer following the attention in each layer (Vaswani et al., Tensor2Tensor for Neural Machine Translation. 2018) Options:true
,false
.layerdrop
(default:0.2
): Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand with Structured Dropout. ICLR 2020)emb_dim
(default:512
): Dimensionality of the encoder layers and the pooler layer.n_layers
(default:6
): Number of hidden layers in the Transformer encoder.n_heads
(default:8
): Number of attention heads for each attention layer in the Transformer encoder.attention_dropout
(default:0.1
): The dropout probability for the attention mechanismgelu_activation
(default:true
): Whether or not to use a gelu activation instead of relu. Options:true
,false
.sinusoidal_embeddings
(default:false
): Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings. Options:true
,false
.causal
(default:false
): Whether or not the model should behave in a causal manner. Causal models use a triangular attention mask in order to only attend to the left-side context instead if a bidirectional context. Options:true
,false
.asm
(default:false
): Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction layer. Options:true
,false
.n_langs
(default:1
): The number of languages the model handles. Set to 1 for monolingual models.use_lang_emb
(default:true
): Whether to use language embeddings. Some models use additional language embeddings, see the multilingual models page for information on how to use them. Options:true
,false
.max_position_embeddings
(default:512
): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).embed_init_std
(default:0.02209708691207961
): The standard deviation of the truncated_normal_initializer for initializing the embedding matrices.init_std
(default:0.02
): The standard deviation of the truncated_normal_initializer for initializing all weight matrices except the embedding matrices.layer_norm_eps
(default:1e-06
): The epsilon used by the layer normalization layers.bos_index
(default:0
): The index of the beginning of sentence token in the vocabulary.eos_index
(default:1
): The index of the end of sentence token in the vocabulary.pad_index
(default:2
): The index of the padding token in the vocabulary.unk_index
(default:3
): The index of the unknown token in the vocabulary.mask_index
(default:5
): The index of the masking token in the vocabulary.is_encoder
(default:true
): Whether or not the initialized model should be a transformer encoder or decoder as seen in Vaswani et al. Options:true
,false
.mask_token_id
(default:0
): Model agnostic parameter to identify masked tokens when generating text in an MLM context.lang_id
(default:0
): The ID of the language used by the model. This parameter is used when generating text in a given language.-
pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning
GPT¶
The gpt
encoder loads a pretrained GPT (default openai-gpt
) model using the Hugging Face transformers package. GPT is a causal (unidirectional) transformer pre-trained using language modeling on a large corpus with long range dependencies, the Toronto Book Corpus.
encoder:
type: gpt
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
reduce_output: sum
initializer_range: 0.02
adapter: null
n_positions: 40478
n_ctx: 512
n_embd: 768
n_layer: 12
n_head: 12
afn: gelu
resid_pdrop: 0.1
embd_pdrop: 0.1
attn_pdrop: 0.1
layer_norm_epsilon: 1.0e-05
pretrained_kwargs: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:openai-gpt
): Name or path of the pretrained model.reduce_output
(default:sum
): The method used to reduce a sequence of tensors down to a single tensor.-
initializer_range
(default:0.02
): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning -
n_positions
(default:40478
): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). n_ctx
(default:512
): Dimensionality of the causal mask (usually same as n_positions)n_embd
(default:768
): Dimensionality of the embeddings and hidden states.n_layer
(default:12
): Number of hidden layers in the Transformer encoder.n_head
(default:12
): Number of attention heads for each attention layer in the Transformer encoder.afn
(default:gelu
): The non-linear activation function (function or string) in the encoder and pooler. Options:gelu
,relu
,silu
.resid_pdrop
(default:0.1
): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.embd_pdrop
(default:0.1
): The dropout ratio for the embeddings.attn_pdrop
(default:0.1
): The dropout ratio for the attention.layer_norm_epsilon
(default:1e-05
): The epsilon to use in the layer normalization layerspretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model.
GPT2¶
The gpt2
encoder loads a pretrained GPT-2 (default gpt2
) model using the Hugging Face transformers package. GPT-2 is a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.
encoder:
type: gpt2
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
reduce_output: sum
initializer_range: 0.02
adapter: null
n_positions: 1024
n_ctx: 1024
n_embd: 768
n_layer: 12
n_head: 12
n_inner: null
activation_function: gelu_new
resid_pdrop: 0.1
embd_pdrop: 0.1
attn_pdrop: 0.1
layer_norm_epsilon: 1.0e-05
scale_attn_weights: true
pretrained_kwargs: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:gpt2
): Name or path of the pretrained model.reduce_output
(default:sum
): The method used to reduce a sequence of tensors down to a single tensor.-
initializer_range
(default:0.02
): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning -
n_positions
(default:1024
): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). n_ctx
(default:1024
): Dimensionality of the causal mask (usually same as n_positions)n_embd
(default:768
): Dimensionality of the embeddings and hidden states.n_layer
(default:12
): Number of hidden layers in the Transformer encoder.n_head
(default:12
): Number of attention heads for each attention layer in the Transformer encoder.n_inner
(default:null
): Dimensionality of the inner feed-forward layers. None will set it to 4 times n_embdactivation_function
(default:gelu_new
): Activation function, to be selected in the list ['relu', 'silu', 'gelu', 'tanh', 'gelu_new']. Options:relu
,silu
,gelu
,tanh
,gelu_new
.resid_pdrop
(default:0.1
): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.embd_pdrop
(default:0.1
): The dropout ratio for the embeddings.attn_pdrop
(default:0.1
): The dropout ratio for the attention.layer_norm_epsilon
(default:1e-05
): The epsilon to use in the layer normalization layers.scale_attn_weights
(default:true
): Scale attention weights by dividing by sqrt(hidden_size). Options:true
,false
.pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model.
Longformer¶
The longformer
encoder loads a pretrained Longformer (default allenai/longformer-base-4096
) model using the Hugging Face transformers package. Longformer is a good choice for longer text, as it supports sequences up to 4096 tokens long.
encoder:
type: longformer
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
max_position_embeddings: 4098
reduce_output: cls_pooled
attention_window: 512
sep_token_id: 2
adapter: null
type_vocab_size: 1
pretrained_kwargs: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:allenai/longformer-base-4096
): Name or path of the pretrained model.max_position_embeddings
(default:4098
): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).-
reduce_output
(default:cls_pooled
): The method used to reduce a sequence of tensors down to a single tensor. -
attention_window
(default:512
): Size of an attention window around each token. If an int, use the same size for all layers. To specify a different window size for each layer, use a List[int] where len(attention_window) == num_hidden_layers. -
sep_token_id
(default:2
): ID of the separator token, which is used when building a sequence from multiple sequences -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning -
type_vocab_size
(default:1
): The vocabulary size of the token_type_ids passed when calling LongformerEncoder pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model.
RoBERTa¶
The roberta
encoder loads a pretrained RoBERTa (default roberta-base
) model using the Hugging Face transformers package. Replication of BERT pretraining which may match or exceed the performance of BERT. RoBERTa builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.
encoder:
type: roberta
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
reduce_output: cls_pooled
eos_token_id: 2
adapter: null
pad_token_id: 1
bos_token_id: 0
pretrained_kwargs: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:roberta-base
): Name or path of the pretrained model.reduce_output
(default:cls_pooled
): The method used to reduce a sequence of tensors down to a single tensor.-
eos_token_id
(default:2
): The end of sequence token ID. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning -
pad_token_id
(default:1
): The ID of the token to use as padding. bos_token_id
(default:0
): The beginning of sequence token ID.pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model.
T5¶
The t5
encoder loads a pretrained T5 (default t5-small
) model using the Hugging Face transformers package. T5 (Text-to-Text Transfer Transformer) is pre-trained on a huge text dataset crawled from the web and shows good transfer performance on multiple tasks.
encoder:
type: t5
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
num_layers: 6
dropout_rate: 0.1
reduce_output: sum
d_ff: 2048
adapter: null
d_model: 512
d_kv: 64
num_decoder_layers: 6
num_heads: 8
relative_attention_num_buckets: 32
layer_norm_eps: 1.0e-06
initializer_factor: 1
feed_forward_proj: relu
pretrained_kwargs: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:t5-small
): Name or path of the pretrained model.num_layers
(default:6
): Number of hidden layers in the Transformer encoder.dropout_rate
(default:0.1
): The ratio for all dropout layers.reduce_output
(default:sum
): The method used to reduce a sequence of tensors down to a single tensor.-
d_ff
(default:2048
): Size of the intermediate feed forward layer in each T5Block. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning -
d_model
(default:512
): Size of the encoder layers and the pooler layer. d_kv
(default:64
): Size of the key, query, value projections per attention head. d_kv has to be equal to d_model // num_heads.num_decoder_layers
(default:6
): Number of hidden layers in the Transformer decoder. Will use the same value as num_layers if not set.num_heads
(default:8
): Number of attention heads for each attention layer in the Transformer encoder.relative_attention_num_buckets
(default:32
): The number of buckets to use for each attention layer.layer_norm_eps
(default:1e-06
): The epsilon used by the layer normalization layers.initializer_factor
(default:1
): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).feed_forward_proj
(default:relu
): Type of feed forward layer to be used. Should be one of 'relu' or 'gated-gelu'. T5v1.1 uses the 'gated-gelu' feed forward projection. Original T5 uses 'relu'. Options:relu
,gated-gelu
.pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model.
TransformerXL¶
The transformer_xl
encoder loads a pretrained Transformer-XL (default transfo-xl-wt103
) model using the Hugging Face transformers package. Adds novel positional encoding scheme which improves understanding and generation of long-form text up to thousands of tokens. Transformer-XL is a causal (uni-directional) transformer with relative positioning (sinusoΓ―dal) embeddings which can reuse previously computed hidden-states to attend to longer context (memory). This model also uses adaptive softmax inputs and outputs (tied).
encoder:
type: transformer_xl
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
dropout: 0.1
reduce_output: sum
adaptive: true
adapter: null
cutoffs:
- 20000
- 40000
- 200000
d_model: 1024
d_embed: 1024
n_head: 16
d_head: 64
d_inner: 4096
div_val: 4
pre_lnorm: false
n_layer: 18
mem_len: 1600
clamp_len: 1000
same_length: true
proj_share_all_but_first: true
attn_type: 0
sample_softmax: -1
dropatt: 0.0
untie_r: true
init: normal
init_range: 0.01
proj_init_std: 0.01
init_std: 0.02
layer_norm_epsilon: 1.0e-05
eos_token_id: 0
pretrained_kwargs: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:transfo-xl-wt103
): Name or path of the pretrained model.dropout
(default:0.1
): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.reduce_output
(default:sum
): The method used to reduce a sequence of tensors down to a single tensor.-
adaptive
(default:true
): Whether or not to use adaptive softmax. Options:true
,false
. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning -
cutoffs
(default:[20000, 40000, 200000]
): Cutoffs for the adaptive softmax. d_model
(default:1024
): Dimensionality of the modelβs hidden states.d_embed
(default:1024
): Dimensionality of the embeddingsn_head
(default:16
): Number of attention heads for each attention layer in the Transformer encoder.d_head
(default:64
): Dimensionality of the modelβs heads.d_inner
(default:4096
): Inner dimension in FFdiv_val
(default:4
): Divident value for adapative input and softmax.pre_lnorm
(default:false
): Whether or not to apply LayerNorm to the input instead of the output in the blocks. Options:true
,false
.n_layer
(default:18
): Number of hidden layers in the Transformer encoder.mem_len
(default:1600
): Length of the retained previous heads.clamp_len
(default:1000
): Use the same pos embeddings after clamp_len.same_length
(default:true
): Whether or not to use the same attn length for all tokens Options:true
,false
.proj_share_all_but_first
(default:true
): True to share all but first projs, False not to share. Options:true
,false
.attn_type
(default:0
): Attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.sample_softmax
(default:-1
): Number of samples in the sampled softmax.dropatt
(default:0.0
): The dropout ratio for the attention probabilities.untie_r
(default:true
): Whether ot not to untie relative position biases. Options:true
,false
.init
(default:normal
): Parameter initializer to use.init_range
(default:0.01
): Parameters initialized by U(-init_range, init_range).proj_init_std
(default:0.01
): Parameters initialized by N(0, init_std)init_std
(default:0.02
): Parameters initialized by N(0, init_std)layer_norm_epsilon
(default:1e-05
): The epsilon to use in the layer normalization layerseos_token_id
(default:0
): The end of sequence token ID.pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model.
XLMRoBERTa¶
The xlmroberta
encoder loads a pretrained XLM-RoBERTa (default jplu/tf-xlm-reoberta-base
) model using the Hugging Face transformers package. XLM-RoBERTa is a multi-language model similar to BERT, trained on 100 languages. XLM-RoBERTa is based on Facebookβs RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
encoder:
type: xlmroberta
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
reduce_output: cls_pooled
max_position_embeddings: 514
type_vocab_size: 1
adapter: null
pad_token_id: 1
bos_token_id: 0
eos_token_id: 2
add_pooling_layer: true
pretrained_kwargs: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:xlm-roberta-base
): Name or path of the pretrained model.reduce_output
(default:cls_pooled
): The method used to reduce a sequence of tensors down to a single tensor.max_position_embeddings
(default:514
): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).-
type_vocab_size
(default:1
): The vocabulary size of the token_type_ids passed in. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning -
pad_token_id
(default:1
): The ID of the token to use as padding. bos_token_id
(default:0
): The beginning of sequence token ID.eos_token_id
(default:2
): The end of sequence token ID.add_pooling_layer
(default:true
): Whether to add a pooling layer to the encoder. Options:true
,false
.pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model.
XLNet¶
The xlnet
encoder loads a pretrained XLNet (default xlnet-base-cased
) model using the Hugging Face transformers package. XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization order. XLNet outperforms BERT on a variety of benchmarks.
encoder:
type: xlnet
use_pretrained: true
trainable: false
pretrained_model_name_or_path: bert
dropout: 0.1
reduce_output: sum
ff_activation: gelu
initializer_range: 0.02
summary_activation: tanh
summary_last_dropout: 0.1
adapter: null
d_model: 768
n_layer: 12
n_head: 12
d_inner: 3072
untie_r: true
attn_type: bi
layer_norm_eps: 1.0e-12
mem_len: null
reuse_len: null
use_mems_eval: true
use_mems_train: false
bi_data: false
clamp_len: -1
same_length: false
summary_type: last
summary_use_proj: true
start_n_top: 5
end_n_top: 5
pad_token_id: 5
bos_token_id: 1
eos_token_id: 2
pretrained_kwargs: null
Parameters:
use_pretrained
(default:true
) : Whether to use the pretrained weights for the model. If false, the model will train from scratch which is very computationally expensive. Options:true
,false
.trainable
(default:false
) : Whether to finetune the model on your dataset. Options:true
,false
.pretrained_model_name_or_path
(default:xlnet-base-cased
): Name or path of the pretrained model.dropout
(default:0.1
): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.reduce_output
(default:sum
): The method used to reduce a sequence of tensors down to a single tensor.ff_activation
(default:gelu
): The non-linear activation function (function or string) in the encoder and pooler. If string, 'gelu', 'relu', 'silu' and 'gelu_new' are supported. Options:gelu
,relu
,silu
,gelu_new
.initializer_range
(default:0.02
): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.summary_activation
(default:tanh
): Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.-
summary_last_dropout
(default:0.1
): Used in the sequence classification and multiple choice models. -
adapter
(default:null
): Whether to use parameter-efficient fine-tuning -
d_model
(default:768
): Dimensionality of the encoder layers and the pooler layer. n_layer
(default:12
): Number of hidden layers in the Transformer encoder.n_head
(default:12
): Number of attention heads for each attention layer in the Transformer encoder.d_inner
(default:3072
): Dimensionality of the βintermediateβ (often named feed-forward) layer in the Transformer encoder.untie_r
(default:true
): Whether or not to untie relative position biases Options:true
,false
.attn_type
(default:bi
): The attention type used by the model. Currently only 'bi' is supported. Options:bi
.layer_norm_eps
(default:1e-12
): The epsilon used by the layer normalization layers.mem_len
(default:null
): The number of tokens to cache. The key/value pairs that have already been pre-computed in a previous forward pass wonβt be re-computed.reuse_len
(default:null
): The number of tokens in the current batch to be cached and reused in the future.use_mems_eval
(default:true
): Whether or not the model should make use of the recurrent memory mechanism in evaluation mode. Options:true
,false
.use_mems_train
(default:false
): Whether or not the model should make use of the recurrent memory mechanism in train mode. Options:true
,false
.bi_data
(default:false
): Whether or not to use bidirectional input pipeline. Usually set to True during pretraining and False during finetuning. Options:true
,false
.clamp_len
(default:-1
): Clamp all relative distances larger than clamp_len. Setting this attribute to -1 means no clamping.same_length
(default:false
): Whether or not to use the same attention length for each token. Options:true
,false
.summary_type
(default:last
): Argument used when doing sequence summary. Used in the sequence classification and multiple choice models. Options:last
,first
,mean
,cls_index
,attn
.summary_use_proj
(default:true
): Options:true
,false
.start_n_top
(default:5
): Used in the SQuAD evaluation script.end_n_top
(default:5
): Used in the SQuAD evaluation script.pad_token_id
(default:5
): The ID of the token to use as padding.bos_token_id
(default:1
): The beginning of sequence token ID.eos_token_id
(default:2
): The end of sequence token ID.pretrained_kwargs
(default:null
): Additional kwargs to pass to the pretrained model.
LLM Encoders¶
graph LR
A["12\n7\n43\n65\n23\n4\n1"] --> B["Pretrained\n LLM"];
B --> C["Last\n Hidden\n State"];
C --> ...;
The LLM encoder processes text with a pretrained LLM (ex. llama-2-7b
) passes the last hidden state of the LLM forward to the combiner. Like the LLM model type, adapter-based fine-tuning and quantization can be configured, and any combiner or decoder parameters will be bundled with the adapter weights.
Example config:
encoder:
type: llm
base_model: meta-llama/Llama-2-7b-hf
adapter:
type: lora
quantization:
bits: 4
Parameters:
Base Model¶
The base_model
parameter specifies the pretrained large language model to serve
as the foundation of your custom LLM.
More information about the base_model
parameter can be found here
Adapter¶
LoRA¶
LoRA is a simple, yet effective, method for parameter-efficient fine-tuning of pretrained language models. It works by adding a small number of trainable parameters to the model, which are used to adapt the pretrained parameters to the downstream task. This allows the model to be fine-tuned with a much smaller number of training examples, and can even be used to fine-tune models on tasks that have no training data available at all.
adapter:
type: lora
r: 8
dropout: 0.05
target_modules: null
use_rslora: false
use_dora: false
alpha: 16
pretrained_adapter_weights: null
postprocessor:
merge_adapter_into_base_model: false
progressbar: false
bias_type: none
r
(default:8
) : Lora attention dimension.dropout
(default:0.05
): The dropout probability for Lora layers.target_modules
(default:null
): List of module names or regex expression of the module names to replace with LoRA. For example, ['q', 'v'] or '.decoder.(SelfAttention|EncDecAttention).*(q|v)$'. Defaults to targeting the query and value matrices of all self-attention and encoder-decoder attention layers.use_rslora
(default:false
): When set to True, uses Rank-Stabilized LoRA which sets the adapter scaling factor to lora_alpha/math.sqrt(r), since it was proven to work better. Otherwise, it will use the original default value of lora_alpha/r. Paper: https://arxiv.org/abs/2312.03732. Options:true
,false
.use_dora
(default:false
): Enable 'Weight-Decomposed Low-Rank Adaptation' (DoRA). This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, especially at low ranks. Right now, DoRA only supports non-quantized linear layers. DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference. For more information, see https://arxiv.org/abs/2402.09353 Options:true
,false
.alpha
(default:null
): The alpha parameter for Lora scaling. Defaults to2 * r
.pretrained_adapter_weights
(default:null
): Path to pretrained weights.postprocessor
:postprocessor.merge_adapter_into_base_model
(default:false
): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately). Options:true
,false
.postprocessor.progressbar
(default:false
): Instructs whether or not to show a progress bar indicating the unload and merge process. Options:true
,false
.bias_type
(default:none
): Bias type for Lora. Options:none
,all
,lora_only
.
AdaLoRA¶
AdaLoRA is an extension of LoRA that allows the model to adapt the pretrained parameters to the downstream task in a task-specific manner. This is done by adding a small number of trainable parameters to the model, which are used to adapt the pretrained parameters to the downstream task. This allows the model to be fine-tuned with a much smaller number of training examples, and can even be used to fine-tune models on tasks that have no training data available at all.
adapter:
type: adalora
r: 8
dropout: 0.05
target_modules: null
use_rslora: false
use_dora: false
alpha: 16
pretrained_adapter_weights: null
postprocessor:
merge_adapter_into_base_model: false
progressbar: false
bias_type: none
target_r: 8
init_r: 12
tinit: 0
tfinal: 0
delta_t: 1
beta1: 0.85
beta2: 0.85
orth_reg_weight: 0.5
total_step: null
rank_pattern: null
r
(default:8
) : Lora attention dimension.dropout
(default:0.05
): The dropout probability for Lora layers.target_modules
(default:null
): List of module names or regex expression of the module names to replace with LoRA. For example, ['q', 'v'] or '.decoder.(SelfAttention|EncDecAttention).*(q|v)$'. Defaults to targeting the query and value matrices of all self-attention and encoder-decoder attention layers.use_rslora
(default:false
): When set to True, uses Rank-Stabilized LoRA which sets the adapter scaling factor to lora_alpha/math.sqrt(r), since it was proven to work better. Otherwise, it will use the original default value of lora_alpha/r. Paper: https://arxiv.org/abs/2312.03732. Options:true
,false
.use_dora
(default:false
): Enable 'Weight-Decomposed Low-Rank Adaptation' (DoRA). This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, especially at low ranks. Right now, DoRA only supports non-quantized linear layers. DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference. For more information, see https://arxiv.org/abs/2402.09353 Options:true
,false
.alpha
(default:null
): The alpha parameter for Lora scaling. Defaults to2 * r
.pretrained_adapter_weights
(default:null
): Path to pretrained weights.postprocessor
:postprocessor.merge_adapter_into_base_model
(default:false
): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately). Options:true
,false
.postprocessor.progressbar
(default:false
): Instructs whether or not to show a progress bar indicating the unload and merge process. Options:true
,false
.bias_type
(default:none
): Bias type for Lora. Options:none
,all
,lora_only
.target_r
(default:8
): Target Lora Matrix Dimension. The target average rank of incremental matrix.init_r
(default:12
): Initial Lora Matrix Dimension. The initial rank for each incremental matrix.tinit
(default:0
): The steps of initial fine-tuning warmup.tfinal
(default:0
): The steps of final fine-tuning warmup.delta_t
(default:1
): The time internval between two budget allocations. The step interval of rank allocation.beta1
(default:0.85
): The hyperparameter of EMA for sensitivity smoothing.beta2
(default:0.85
): The hyperparameter of EMA for undertainty quantification.orth_reg_weight
(default:0.5
): The coefficient of orthogonality regularization.total_step
(default:null
): The total training steps that should be specified before training.rank_pattern
(default:null
): The allocated rank for each weight matrix by RankAllocator.
IA3¶
Infused Adapter by Inhibiting and Amplifying Inner Activations, or IA3,
is a method that adds three learned vectors l_k``,
l_v`, and
l_ff`, to rescale the keys and values of the self-attention and encoder-decoder attention layers, and the intermediate activation of the position-wise feed-forward network respectively. These learned vectors are the only trainable parameters during fine-tuning, and thus the original weights remain frozen. Dealing with learned vectors (as opposed to learned low-rank updates to a weight matrix like LoRA) keeps the number of trainable parameters much smaller.
adapter:
type: ia3
target_modules: null
feedforward_modules: null
fan_in_fan_out: false
modules_to_save: null
init_ia3_weights: true
pretrained_adapter_weights: null
postprocessor:
merge_adapter_into_base_model: false
progressbar: false
target_modules
(default:null
) : The names of the modules to apply (IA)^3 to.feedforward_modules
(default:null
) : The names of the modules to be treated as feedforward modules, as in the original paper. These modules will have (IA)^3 vectors multiplied to the input, instead of the output. feedforward_modules must be a name or a subset of names present in target_modules.fan_in_fan_out
(default:false
) : Set this to True if the layer to replace stores weight like (fan_in, fan_out). For example, gpt-2 uses Conv1D which stores weights like (fan_in, fan_out) and hence this should be set to True. Options:true
,false
.modules_to_save
(default:null
) : List of modules apart from (IA)^3 layers to be set as trainable and saved in the final checkpoint.init_ia3_weights
(default:true
) : Whether to initialize the vectors in the (IA)^3 layers, defaults to True. Options:true
,false
.pretrained_adapter_weights
(default:null
): Path to pretrained weights.postprocessor
:postprocessor.merge_adapter_into_base_model
(default:false
): Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single model (rather than having to load base and fine-tuned models separately). Options:true
,false
.postprocessor.progressbar
(default:false
): Instructs whether or not to show a progress bar indicating the unload and merge process. Options:true
,false
.
More information about the adapter config can be found here.
Quantization¶
Attention
Quantized fine-tuning currently requires using adapter: lora
. In-context
learning does not have this restriction.
Attention
Quantization is currently only supported with backend: local
.
quantization:
bits: 4
llm_int8_threshold: 6.0
llm_int8_has_fp16_weight: false
bnb_4bit_compute_dtype: float16
bnb_4bit_use_double_quant: true
bnb_4bit_quant_type: nf4
bits
(default:4
) : The quantization level to apply to weights on load. Options:4
,8
.llm_int8_threshold
(default:6.0
): This corresponds to the outlier threshold for outlier detection as described inLLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale
paper: https://arxiv.org/abs/2208.07339. Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).llm_int8_has_fp16_weight
(default:false
): This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not have to be converted back and forth for the backward pass. Options:true
,false
.bnb_4bit_compute_dtype
(default:float16
): This sets the computational type which might be different than the input type. For example, inputs might be fp32, but computation can be set to bf16 for speedups. Options:float32
,float16
,bfloat16
.bnb_4bit_use_double_quant
(default:true
): This flag is used for nested quantization where the quantization constants from the first quantization are quantized again. Options:true
,false
.bnb_4bit_quant_type
(default:nf4
): This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options:fp4
,nf4
.
More information about quantization parameters can be found here.
Model Parameters¶
More information about the model initialization parameters can be found here.
Output Features¶
Text output features are a special case of Sequence Features, so all options of sequence features are available for text features as well.
Text output features can be used for either tagging (classifying each token of an input sequence) or text
generation (generating text by repeatedly sampling from the model). There are two decoders available for these tasks
named tagger
and generator
respectively.
Example text output feature using default parameters:
name: text_column_name
type: text
reduce_input: null
dependencies: []
reduce_dependencies: sum
loss:
type: softmax_cross_entropy
confidence_penalty: 0
robust_lambda: 0
class_weights: 1
class_similarities_temperature: 0
decoder:
type: generator
Parameters:
reduce_input
(defaultsum
): defines how to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the sequence dimension),last
(returns the last vector of the sequence dimension).dependencies
(default[]
): the output features this one is dependent on. For a detailed explanation refer to Output Feature Dependencies.reduce_dependencies
(defaultsum
): defines how to reduce the output of a dependent feature that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the sequence dimension),last
(returns the last vector of the sequence dimension).loss
(default{type: softmax_cross_entropy, class_similarities_temperature: 0, class_weights: 1, confidence_penalty: 0, robust_lambda: 0}
): is a dictionary containing a losstype
. The only available losstype
for text features issoftmax_cross_entropy
. See Loss for details.decoder
(default:{"type": "generator"}
): Decoder for the desired task. Options:generator
,tagger
. See Decoder for details.
Decoder type and decoder parameters can also be defined once and applied to all text output features using the Type-Global Decoder section. Loss and loss related parameters can also be defined once in the same way.
Decoders¶
Generator¶
graph LR
A["Combiner Output"] --> B["Fully\n Connected\n Layers"];
B --> C1["RNN"] --> C2["RNN"] --> C3["RNN"];
GO(["GO"]) -.-o C1;
C1 -.-o O1("Output");
O1 -.-o C2;
C2 -.-o O2("Output");
O2 -.-o C3;
C3 -.-o END(["END"]);
subgraph DEC["DECODER.."]
B
C1
C2
C3
end
In the case of generator
the decoder is a (potentially empty) stack of fully connected layers, followed by an RNN that
generates outputs feeding on its own previous predictions and generates a tensor of size b x s' x c
, where b
is the
batch size, s'
is the length of the generated sequence and c
is the number of classes, followed by a
softmax_cross_entropy.
During training teacher forcing is adopted, meaning the list of targets is provided as both inputs and outputs (shifted
by 1), while at evaluation time greedy decoding (generating one token at a time and feeding it as input for the next
step) is performed by beam search, using a beam of 1 by default.
In general a generator expects a b x h
shaped input tensor, where h
is a hidden dimension.
The h
vectors are (after an optional stack of fully connected layers) fed into the rnn generator.
One exception is when the generator uses attention, as in that case the expected size of the input tensor is
b x s x h
, which is the output of a sequence, text or time series input feature without reduced outputs or the output
of a sequence-based combiner.
If a b x h
input is provided to a generator decoder using an RNN with attention instead, an error will be raised
during model building.
decoder:
type: generator
num_fc_layers: 0
fc_output_size: 256
fc_norm: null
fc_dropout: 0.0
cell_type: gru
num_layers: 1
fc_activation: relu
reduce_input: sum
fc_layers: null
fc_use_bias: true
fc_weights_initializer: xavier_uniform
fc_bias_initializer: zeros
fc_norm_params: null
Parameters:
num_fc_layers
(default:0
) : Number of fully-connected layers iffc_layers
not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.fc_output_size
(default:256
) : Output size of fully connected stack.fc_norm
(default:null
) : Default normalization applied at the beginnging of fully connected layers. Options:batch
,layer
,ghost
,null
.fc_dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).cell_type
(default:gru
) : Type of recurrent cell to use. Options:rnn
,lstm
,gru
.num_layers
(default:1
) : The number of stacked recurrent layers.fc_activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.reduce_input
(default:sum
): How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension) Options:sum
,mean
,avg
,max
,concat
,last
.fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.fc_use_bias
(default:true
): Whether the layer uses a bias vector in the fc_stack. Options:true
,false
.fc_weights_initializer
(default:xavier_uniform
): The weights initializer to use for the layers in the fc_stackfc_bias_initializer
(default:zeros
): The bias initializer to use for the layers in the fc_stackfc_norm_params
(default:null
): Default parameters passed to thenorm
module.
Tagger¶
graph LR
A["emb[0]\n....\nemb[n]"] --> B["Fully\n Connected\n Layers"];
B --> C["Projection\n....\nProjection"];
C --> D["Softmax\n....\nSoftmax"];
subgraph DEC["DECODER.."]
B
C
D
end
subgraph COM["COMBINER OUT.."]
A
end
In the case of tagger
the decoder is a (potentially empty) stack of fully connected layers, followed by a projection
into a tensor of size b x s x c
, where b
is the batch size, s
is the length of the sequence and c
is the number
of classes, followed by a softmax_cross_entropy.
This decoder requires its input to be shaped as b x s x h
, where h
is a hidden dimension, which is the output of a
sequence, text or time series input feature without reduced outputs or the output of a sequence-based combiner.
If a b x h
input is provided instead, an error will be raised during model building.
decoder:
type: tagger
num_fc_layers: 0
fc_output_size: 256
fc_norm: null
fc_dropout: 0.0
fc_activation: relu
attention_embedding_size: 256
fc_layers: null
fc_use_bias: true
fc_weights_initializer: xavier_uniform
fc_bias_initializer: zeros
fc_norm_params: null
use_attention: false
use_bias: true
attention_num_heads: 8
Parameters:
num_fc_layers
(default:0
) : Number of fully-connected layers iffc_layers
not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.fc_output_size
(default:256
) : Output size of fully connected stack.fc_norm
(default:null
) : Default normalization applied at the beginnging of fully connected layers. Options:batch
,layer
,ghost
,null
.fc_dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).fc_activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.attention_embedding_size
(default:256
): The embedding size of the multi-head self attention layer.fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.fc_use_bias
(default:true
): Whether the layer uses a bias vector in the fc_stack. Options:true
,false
.fc_weights_initializer
(default:xavier_uniform
): The weights initializer to use for the layers in the fc_stackfc_bias_initializer
(default:zeros
): The bias initializer to use for the layers in the fc_stackfc_norm_params
(default:null
): Default parameters passed to thenorm
module.use_attention
(default:false
): Whether to apply a multi-head self attention layer before prediction. Options:true
,false
.use_bias
(default:true
): Whether the layer uses a bias vector. Options:true
,false
.attention_num_heads
(default:8
): The number of attention heads in the multi-head self attention layer.
Loss¶
Sequence Softmax Cross Entropy¶
loss:
type: sequence_softmax_cross_entropy
class_weights: null
weight: 1.0
robust_lambda: 0
confidence_penalty: 0
class_similarities: null
class_similarities_temperature: 0
unique: false
Parameters:
class_weights
(default:null
) : Weights to apply to each class in the loss. If not specified, all classes are weighted equally. The value can be a vector of weights, one for each class, that is multiplied to the loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in case of unbalanced class distribution. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the<UNK>
class needs to be included too). Alternatively, the value can be a dictionary with class strings as keys and weights as values, like{class_a: 0.5, class_b: 0.7, ...}
.weight
(default:1.0
): Weight of the loss.robust_lambda
(default:0
): Replaces the loss with(1 - robust_lambda) * loss + robust_lambda / c
wherec
is the number of classes. Useful in case of noisy labels.confidence_penalty
(default:0
): Penalizes overconfident predictions (low entropy) by adding an additional term that penalizes too confident predictions by adding aa * (max_entropy - entropy) / max_entropy
term to the loss, where a is the value of this parameter. Useful in case of noisy labels.class_similarities
(default:null
): If notnull
it is ac x c
matrix in the form of a list of lists that contains the mutual similarity of classes. It is used ifclass_similarities_temperature
is greater than 0. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the<UNK>
class needs to be included too).class_similarities_temperature
(default:0
): The temperature parameter of the softmax that is performed on each row ofclass_similarities
. The output of that softmax is used to determine the supervision vector to provide instead of the one hot vector that would be provided otherwise for each datapoint. The intuition behind it is that errors between similar classes are more tolerable than errors between really different classes.unique
(default:false
): If true, the loss is only computed for unique elements in the sequence. Options:true
,false
.
Metrics¶
The metrics available for text features are the same as for Sequence Features:
sequence_accuracy
The rate at which the model predicted the correct sequence.token_accuracy
The number of tokens correctly predicted divided by the total number of tokens in all sequences.last_accuracy
Accuracy considering only the last element of the sequence. Useful to ensure special end-of-sequence tokens are generated or tagged.edit_distance
Levenshtein distance: the minimum number of single-token edits (insertions, deletions or substitutions) required to change predicted sequence to ground truth.perplexity
Perplexity is the inverse of the predicted probability of the ground truth sequence, normalized by the number of tokens. The lower the perplexity, the higher the probability of predicting the true sequence.loss
The value of the loss function.
You can set any of the above as validation_metric
in the training
section of the configuration if validation_field
names a sequence feature.