Combiner
Combiners take the outputs of all input features encoders and combine them before providing the combined representation to the output feature decoders.
You can specify which one to use in the combiner
section of the configuration, and if you don't specify a combiner,
the concat
combiner will be used.
Combiner Types¶
Concat Combiner¶
graph LR
I1[Encoder Output 1] > C[Concat];
IK[...] > C;
IN[Encoder Output N] > C;
C > FC[Fully Connected Layers];
FC > ...;
subgraph COMBINER..
C
FC
end
The concat
combiner assumes all outputs from encoders are tensors of size b x h
where b
is the batch size and h
is the hidden dimension, which can be different for each input.
If any inputs have more than 2 dimensions, a sequence or set feature for example, set the flatten_inputs
parameter to true
.
It concatenates along the h
dimension, and then (optionally) passes the concatenated tensor through a stack of fully connected layers.
It returns the final b x h'
tensor where h'
is the size of the last fully connected layer or the sum of the sizes of
the h
of all inputs in the case there are no fully connected layers.
If only a single input feature and no fully connected layer is specified, the output of the input feature encoder is
passed through the combiner unchanged.
combiner:
type: concat
dropout: 0.0
num_fc_layers: 0
output_size: 256
norm: null
activation: relu
flatten_inputs: false
residual: false
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
norm_params: null
fc_layers: null
Parameters:
dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).num_fc_layers
(default:0
) : Number of stacked fully connected layers to apply. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.output_size
(default:256
) : Output size of a fully connected layer.norm
(default:null
) : Default normalization applied at the beginnging of fully connected layers. Options:batch
,layer
,ghost
,null
. See Normalization for details.activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.flatten_inputs
(default:false
): Whether to flatten input tensors to a vector. Options:true
,false
.residual
(default:false
): Whether to add a residual connection to each fully connected layer block. Requires all fully connected layers to have the sameoutput_size
. Options:true
,false
.use_bias
(default:true
): Whether the layer uses a bias vector. Options:true
,false
.
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. 
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. 
norm_params
(default:null
): Default parameters passed to thenorm
module. See Normalization for details. fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
Sequence Concat Combiner¶
graph LR
I1 > X1{Tile};
IK > X1;
IN > X1;
IO > X1;
X1 > SC1;
X1 > SCK;
X1 > SCN;
SC1 > R[Reduce];
SCK > R;
SCN > R;
R > ...;
subgraph CONCAT["TENSOR.."]
direction TB
SC1["emb seq 1  emb oth" ];
SCK[...];
SCN["emb seq n  emb oth"];
end
subgraph COMBINER..
X1
CONCAT
R
end
subgraph SF[SEQUENCE FEATS..]
direction TB
I1["emb seq 1" ];
IK[...];
IN["emb seq n"];
end
subgraph OF[OTHER FEATS..]
direction TB
IO["emb oth"]
end
The sequence_concat
combiner assumes at least one output from encoders is a tensors of size b x s x h
where b
is
the batch size, s
is the length of the sequence and h
is the hidden dimension.
A sequencelike (sequence, text or time series) input feature can be specified with the main_sequence_feature
parameter which takes the name of sequencelike input feature as its value.
If no main_sequence_feature
is specified, the combiner will look through all the features in the order they are
defined in the configuration and will look for a feature with a rank 3 tensor output (sequence, text or time series).
If it cannot find one it will raise an exception, otherwise the output of that feature will be used for concatenating
the other features along the sequence s
dimension.
If there are other input features with a rank 3 output tensor, the combiner will concatenate them alongside the s
dimension, which means that all of them must have identical s
dimension, otherwise a dimension mismatch error will be
returned thrown during training when a datapoint with two sequential features of different lengths are provided.
Other features that have a b x h
rank 2 tensor output will be replicated s
times and concatenated to the s
dimension.
The final output is a b x s x h'
tensor where h'
is the size of the concatenation of the h
dimensions of all input features.
combiner:
type: sequence_concat
main_sequence_feature: null
reduce_output: null
Parameters:

main_sequence_feature
(default:null
) : Name of a sequence, text, or time series feature to concatenate the outputs of the other features to. If nomain_sequence_feature
is specified, the combiner will look through all the features in the order they are defined in the configuration and will look for a feature with a rank 3 tensor output (sequence, text or time series). If it cannot find one it will raise an exception, otherwise the output of that feature will be used for concatenating the other features along the sequences
dimension. If there are other input features with a rank 3 output tensor, the combiner will concatenate them alongside thes
dimension. All sequencelike input features must have identicals
dimension, otherwise an error will be thrown. 
reduce_output
(default:null
): Strategy to use to aggregate the embeddings of the items of the set. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.
Sequence Combiner¶
graph LR
I1 > X1{Tile};
IK > X1;
IN > X1;
IO > X1;
X1 > SC1;
X1 > SCK;
X1 > SCN;
SC1 > R["Sequence Encoder"];
SCK > R;
SCN > R;
R > ...;
subgraph CONCAT["TENSOR.."]
direction TB
SC1["emb seq 1  emb oth" ];
SCK[...];
SCN["emb seq n  emb oth"];
end
subgraph COMBINER..
X1
CONCAT
R
end
subgraph SF[SEQUENCE FEATS..]
direction TB
I1["emb seq 1" ];
IK[...];
IN["emb seq n"];
end
subgraph OF[OTHER FEATS..]
direction TB
IO["emb oth"]
end
The sequence
combiner stacks a sequence concat combiner with a sequence encoder.
All the considerations about input tensor ranks described for the sequence concat combiner
apply also in this case, but the main difference is that this combiner uses the b x s x h'
output of the sequence
concat combiner, where b
is the batch size, s
is the sequence length and h'
is the sum of the hidden dimensions of
all input features, as input for any of the sequence encoders described in the sequence features encoders section.
Refer to that section for more detailed information about the sequence encoders and their parameters.
All considerations on the shape of the outputs for the sequence encoders also apply to sequence combiner.
combiner:
type: sequence
main_sequence_feature: null
reduce_output: null
encoder:
type: parallel_cnn
skip: false
dropout: 0.0
activation: relu
max_sequence_length: null
representation: dense
vocab: null
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
should_embed: true
embedding_size: 256
embeddings_on_cpu: false
embeddings_trainable: true
pretrained_embeddings: null
reduce_output: sum
num_conv_layers: null
conv_layers: null
num_filters: 256
filter_size: 3
pool_function: max
pool_size: null
output_size: 256
norm: null
norm_params: null
num_fc_layers: null
fc_layers: null
Parameters:

main_sequence_feature
(default:null
) : Name of a sequence, text, or time series feature to concatenate the outputs of the other features to. If nomain_sequence_feature
is specified, the combiner will look through all the features in the order they are defined in the configuration and will look for a feature with a rank 3 tensor output (sequence, text or time series). If it cannot find one it will raise an exception, otherwise the output of that feature will be used for concatenating the other features along the sequences
dimension. If there are other input features with a rank 3 output tensor, the combiner will concatenate them alongside thes
dimension. All sequencelike input features must have identicals
dimension, otherwise an error will be thrown. 
reduce_output
(default:null
): Strategy to use to aggregate the embeddings of the items of the set. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
. encoder
(default:{"type": "parallel_cnn"}
): Encoder to apply tomain_sequence_feature
. The encoder must produce a tensor of size [batch_size, sequence_length, hidden_size]
TabNet Combiner¶
graph LR
I1[Encoder Output 1] > C[TabNet];
IK[...] > C;
IN[Encoder Output N] > C;
C > ...;
The tabnet
combiner implements the TabNet model, which uses attention and sparsity
to achieve high performance on tabular data. It assumes all outputs from encoders are tensors of size b x h
where b
is the batch size and h
is the hidden dimension, which can be different for each input.
If the input tensors have a different shape, it automatically flattens them.
It returns the final b x h'
tensor where h'
is the userspecified output size.
combiner:
type: tabnet
size: 32
dropout: 0.05
output_size: 128
num_steps: 3
num_total_blocks: 4
num_shared_blocks: 2
relaxation_factor: 1.5
bn_epsilon: 0.001
bn_momentum: 0.05
bn_virtual_bs: 1024
sparsity: 0.0001
entmax_mode: sparsemax
entmax_alpha: 1.5
Parameters:
size
(default:32
) : Size of the hidden layers.N_a
in (Arik and Pfister, 2019).dropout
(default:0.05
) : Dropout rate for the transformer block.output_size
(default:128
) : Output size of a fully connected layer.N_d
in (Arik and Pfister, 2019).num_steps
(default:3
): Number of steps / repetitions of the the attentive transformer and feature transformer computations.N_steps
in (Arik and Pfister, 2019).num_total_blocks
(default:4
): Total number of feature transformer blocks at each step.num_shared_blocks
(default:2
): Number of shared feature transformer blocks across the steps.relaxation_factor
(default:1.5
): Factor that influences how many times a feature should be used across the steps of computation. a value of 1 implies it each feature should be use once, a higher value allows for multiple usages.gamma
in (Arik and Pfister, 2019).bn_epsilon
(default:0.001
): Epsilon to be added to the batch norm denominator.bn_momentum
(default:0.05
): Momentum of the batch norm. 1 m_B
from the TabNet paper.bn_virtual_bs
(default:1024
): Size of the virtual batch size used by ghost batch norm. If null, regular batch norm is used instead.B_v
from the TabNet paper. See Ghost Batch Normalization for details.sparsity
(default:0.0001
): Multiplier of the sparsity inducing loss.lambda_sparse
in (Arik and Pfister, 2019).entmax_mode
(default:sparsemax
): Entmax is a sparse family of probability mapping which generalizes softmax and sparsemax.entmax_mode
controls the sparsity Options:entmax15
,sparsemax
,constant
,adaptive
.entmax_alpha
(default:1.5
): Must be a number between 1.0 and 2.0. If entmax_mode isadaptive
,entmax_alpha
is used as the initial value for the learnable parameter. 1 corresponds to softmax, 2 is sparsemax.
Transformer Combiner¶
graph LR
I1[Encoder Output 1] > C["Transformer Stack"];
IK[...] > C;
IN[Encoder Output N] > C;
C > R[Reduce];
R > FC[Fully Connected Layers];
FC > ...;
subgraph COMBINER..
C
R
FC
end
The transformer
combiner combines input features using a stack of Transformer blocks (from Attention Is All You Need).
It assumes all outputs from encoders are tensors of size b x h
where b
is the batch size and h
is the hidden
dimension, which can be different for each input.
If the input tensors have a different shape, it automatically flattens them.
It then projects each input tensor to the same hidden / embedding size and encodes them with a stack of Transformer layers.
Finally, the transformer combiner applies a reduction to the outputs of the Transformer stack, followed by optional
fully connected layers.
The output is a b x h'
tensor where h'
is the size of the last fully connected layer or the hidden / embedding
size, or a b x n x h'
where n
is the number of input features and h'
is the hidden / embedding size if no reduction
is applied.
Resources to learn more about transformers:
 CS480/680 Lecture 19: Attention and Transformer Networks (VIDEO)
 Attention is all you need  Attentional Neural Network Models Masterclass (VIDEO)
 Illustrated: SelfAttention (Colab notebook)
combiner:
type: transformer
dropout: 0.1
num_fc_layers: 0
output_size: 256
norm: null
fc_dropout: 0.0
transformer_output_size: 256
hidden_size: 256
num_layers: 1
num_heads: 8
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
norm_params: null
fc_layers: null
fc_activation: relu
fc_residual: false
reduce_output: mean
Parameters:
dropout
(default:0.1
) : Dropout rate for the transformer block.num_fc_layers
(default:0
) : The number of stacked fully connected layers (only applies ifreduce_output
is not null).output_size
(default:256
) : Output size of a fully connected layer.norm
(default:null
) : Default normalization applied at the beginnging of fully connected layers. Options:batch
,layer
,ghost
,null
. See Normalization for details.fc_dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).transformer_output_size
(default:256
): Size of the fully connected layer after self attention in the transformer block. This is usually the same ashidden_size
andembedding_size
.hidden_size
(default:256
): The number of hidden units of the TransformerStack as well as the dimension that each incoming input feature is projected to before feeding to the TransformerStack.num_layers
(default:1
): The number of transformer layers.num_heads
(default:8
): Number of heads of the self attention in the transformer block.use_bias
(default:true
): Whether the layer uses a bias vector. Options:true
,false
.
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. 
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. 
norm_params
(default:null
): Default parameters passed to thenorm
module. See Normalization for details. fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.fc_activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.fc_residual
(default:false
): Whether to add a residual connection to each fully connected layer block. Requires all fully connected layers to have the sameoutput_size
. Options:true
,false
.reduce_output
(default:mean
): Strategy to use to aggregate the output of the transformer. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.
TabTransformer Combiner¶
graph LR
I1[Cat Emb 1] > T1["Concat"];
IK[...] > T1;
IN[Cat Emb N] > T1;
N1[Number ...] > T4["Concat"];
B1[Binary ...] > T4;
T1 > T2["Transformer"];
T2 > T3["Reduce"];
T3 > T4;
T4 > T5["FC Layers"];
T5 > ...;
subgraph COMBINER..
CAT
T4
T5
end
subgraph ENCODER OUT..
I1
IK
IN
N1
B1
end
subgraph CAT["CATEGORY PIPELINE.."]
direction TB
T1
T2
T3
end
The tabtransformer
combiner combines input features in the following sequence of operations. The combiner projects all encoder outputs except binary and number features into an embedding space. These features are concatenated as if they were a sequence and passed through a transformer. After the transformer, the number and binary features are concatenated (which are of size 1) and then concatenated with the output of the transformer and is passed to a stack of fully connected layers (from TabTransformer: Tabular Data Modeling Using Contextual Embeddings).
It assumes all outputs from encoders are tensors of size b x h
where b
is the batch size and h
is the hidden
dimension, which can be different for each input.
If the input tensors have a different shape, it automatically flattens them.
It then projects each input tensor to the same hidden / embedding size and encodes them with a stack of Transformer layers.
Finally, the transformer combiner applies a reduction to the outputs of the Transformer stack, followed by the above concatenation and optional fully connected layers.
The output is a b x h'
tensor where h'
is the size of the last fully connected layer or the hidden / embedding
size, or a b x n x h'
where n
is the number of input features and h'
is the hidden / embedding size if no reduction
is applied.
combiner:
type: tabtransformer
dropout: 0.1
num_fc_layers: 0
output_size: 256
norm: null
fc_dropout: 0.0
embed_input_feature_name: null
transformer_output_size: 256
hidden_size: 256
num_layers: 1
num_heads: 8
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
norm_params: null
fc_layers: null
fc_activation: relu
fc_residual: false
reduce_output: concat
Parameters:
dropout
(default:0.1
) : Dropout rate for the transformer block.num_fc_layers
(default:0
) : The number of stacked fully connected layers (only applies ifreduce_output
is not null).output_size
(default:256
) : Output size of a fully connected layer.norm
(default:null
) : Default normalization applied at the beginnging of fully connected layers. Options:batch
,layer
,ghost
,null
. See Normalization for details.fc_dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).embed_input_feature_name
(default:null
) : This value controls the size of the embeddings. Valid values areadd
which uses thehidden_size
value or an integer that is set to a specific value. In the case of an integer value, it must be smaller than hidden_size.transformer_output_size
(default:256
): Size of the fully connected layer after self attention in the transformer block. This is usually the same ashidden_size
andembedding_size
.hidden_size
(default:256
): The number of hidden units of the TransformerStack as well as the dimension that each incoming input feature is projected to before feeding to the TransformerStack.num_layers
(default:1
): The number of transformer layers.num_heads
(default:8
): Number of heads of the self attention in the transformer block.use_bias
(default:true
): Whether the layer uses a bias vector. Options:true
,false
.
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. 
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. 
norm_params
(default:null
): Default parameters passed to thenorm
module. See Normalization for details. fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.fc_activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.fc_residual
(default:false
): Whether to add a residual connection to each fully connected layer block. Requires all fully connected layers to have the sameoutput_size
. Options:true
,false
.reduce_output
(default:concat
): Strategy to use to aggregate the output of the transformer. Options:last
,sum
,mean
,avg
,max
,concat
,attention
,none
,None
,null
.
Comparator Combiner¶
graph LR
I1[Entity 1 Embed 1] > C1[Concat];
IK[...] > C1;
IN[Entity 1 Embed N] > C1;
C1 > FC1[FC Layers];
FC1 > COMP[Compare];
I2[Entity 2 Embed 1] > C2[Concat];
IK2[...] > C2;
IN2[Entity 2 Embed N] > C2;
C2 > FC2[FC Layers];
FC2 > COMP;
COMP > ...;
subgraph ENTITY1["ENTITY 1.."]
I1
IK
IN
end
subgraph ENTITY2["ENTITY 2.."]
I2
IK2
IN2
end
subgraph COMBINER..
C1
FC1
C2
FC2
COMP
end
The comparator
combiner compares the hidden representation of two entities defined by lists of features.
It assumes all outputs from encoders are tensors of size b x h
where b
is the batch size and h
is the hidden
dimension, which can be different for each input.
If the input tensors have a different shape, it automatically flattens them.
It then concatenates the representations of each entity and projects them both to vectors of size output_size
.
Finally, it compares the two entity representations by dot product, elementwise multiplication, absolute difference and bilinear product.
It returns the final b x h'
tensor where h'
is the size of the concatenation of the four comparisons.
combiner:
type: comparator
entity_1:
 feature_1
 feature_2
entity_2:
 feature_3
dropout: 0.0
num_fc_layers: 1
output_size: 256
norm: null
activation: relu
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
norm_params: null
fc_layers: null
Parameters:
entity_1
(default:null
) : The list of input feature names[feature_1, feature_2, ...]
constituting the first entity to compare. Required.entity_2
(default:null
) : The list of input feature names[feature_1, feature_2, ...]
constituting the second entity to compare. Required.dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).num_fc_layers
(default:1
) : Number of stacked fully connected layers to apply. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.output_size
(default:256
) : Output size of a fully connected layer.norm
(default:null
) : Default normalization applied at the beginnging of fully connected layers. Options:batch
,layer
,ghost
,null
. See Normalization for details.activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.use_bias
(default:true
): Whether the layer uses a bias vector. Options:true
,false
.
bias_initializer
(default:zeros
): Initializer for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. 
weights_initializer
(default:xavier_uniform
): Initializer for the weight matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init. 
norm_params
(default:null
): Default parameters passed to thenorm
module. See Normalization for details. fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
Common Parameters¶
These parameters are used across multiple combiners (and some encoders / decoders) in similar ways.
Normalization¶
Normalization applied at the beginnging of the fullyconnected stack. If a norm
is not already specified for the fc_layers
this is the default norm
that will be used for each layer. One of:
null
: no normalizationbatch
: batch normalizationlayer
: layer normalizationghost
: ghost batch normalization
Batch Normalization¶
Applies Batch Normalization as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. See PyTorch documentation on batch normalization for more details.
norm: batch
norm_params:
eps: 0.001
momentum: 0.1
affine: true
track_running_stats: true
Parameters:
eps
(default:0.001
): Epsilon to be added to the batch norm denominator.momentum
(default:0.1
): The value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default:0.1
.affine
(default:true
): A boolean value that when set totrue
, this module has learnable affine parameters.track_running_stats
(default:true
): A boolean value that when set totrue
, this module tracks the running mean and variance, and when set tofalse
, this module does not track such statistics, and initializes statistics buffers running_mean and running_var asnull
. When these buffers arenull
, this module always uses batch statistics. in both training and eval modes.
Layer Normalization¶
Applies Layer Normalization over a minibatch of inputs as described in the paper Layer Normalization. See PyTorch documentation on layer normalization for more details.
norm: layer
norm_params:
eps: 0.00001
elementwise_affine: true
Parameters:
eps
(default:0.00001
): A value added to the denominator for numerical stability.elementwise_affine
(default:true
): A boolean value that when set totrue
, this module has learnable perelement affine parameters initialized to ones (for weights) and zeros (for biases)
Ghost Batch Normalization¶
Ghost Batch Norm is a technique designed to address the "generalization gap" whereby the training process breaks down with very large batch sizes. If you are using a large batch size (typically in the thousands) to maximize GPU utilization, but the model is not converging well, enabling ghost batch norm can be a useful technique to improve convergence.
When using ghost batch norm, you specify a virtual_batch_size
(default 128
) representing the "ideal" batch size to train with (ignoring throughput or GPU utilization). The ghost batch norm will then subdivide each batch into subbatches of size virtual_batch_size
and apply batch normalization to each.
A notable downside to ghost batch norm is that it is more computationally expensive than traditional batch norm, so it is only recommended to use it when the batch size that maximizes throughput is significantly higher than the batch size that yields the best convergence (one or more orders of magnitude higher).
The approach was introduced in Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks and since popularized by its use in TabNet.
norm: ghost
norm_params:
virtual_batch_size: 128
epsilon: 0.001
momentum: 0.05
Parameters:
virtual_batch_size
(default:128
): Size of the virtual batch size used by ghost batch norm. If null, regular batch norm is used instead.B_v
from the TabNet paper.epsilon
(default:0.001
): Epsilon to be added to the batch norm denominator.momentum
(default:0.05
): Momentum of the batch norm. 1 m_B
from the TabNet paper.