Combiner

Combiners take the outputs of all input features encoders and combine them before providing the combined representation to the output feature decoders.

You can specify which one to use in the combiner section of the configuration, and if you don't specify a combiner, the concat combiner will be used.

Combiner Types¶

Concat Combiner¶

graph LR
  I1[Encoder Output 1] --> C[Concat];
  IK[...] --> C;
  IN[Encoder Output N] --> C;
  C --> FC[Fully Connected Layers];
  FC --> ...;
  subgraph COMBINER..
  C
  FC
  end

The concat combiner assumes all outputs from encoders are tensors of size b x h where b is the batch size and h is the hidden dimension, which can be different for each input. If any inputs have more than 2 dimensions, a sequence or set feature for example, set the flatten_inputs parameter to true. It concatenates along the h dimension, and then (optionally) passes the concatenated tensor through a stack of fully connected layers. It returns the final b x h' tensor where h' is the size of the last fully connected layer or the sum of the sizes of the h of all inputs in the case there are no fully connected layers. If only a single input feature and no fully connected layer is specified, the output of the input feature encoder is passed through the combiner unchanged.

combiner:
    type: concat
    dropout: 0.0
    num_fc_layers: 0
    output_size: 256
    norm: null
    activation: relu
    flatten_inputs: false
    residual: false
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    norm_params: null
    fc_layers: null

Parameters:

dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
num_fc_layers (default: 0) : Number of stacked fully connected layers to apply. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
output_size (default: 256) : Output size of a fully connected layer.
norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null. See Normalization for details.
activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
flatten_inputs (default: false): Whether to flatten input tensors to a vector. Options: true, false.
residual (default: false): Whether to add a residual connection to each fully connected layer block. Requires all fully connected layers to have the same output_size. Options: true, false.
use_bias (default: true): Whether the layer uses a bias vector. Options: true, false.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
norm_params (default: null): Default parameters passed to the norm module. See Normalization for details.
fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.

Sequence Concat Combiner¶

graph LR
  I1 --> X1{Tile};
  IK --> X1;
  IN --> X1;
  IO --> X1;

  X1 --> SC1;
  X1 --> SCK;
  X1 --> SCN;

  SC1 --> R[Reduce];
  SCK --> R;
  SCN --> R;
  R --> ...;
  subgraph CONCAT["TENSOR.."]
    direction TB
    SC1["emb seq 1 | emb oth" ];
    SCK[...];
    SCN["emb seq n | emb oth"];
  end
  subgraph COMBINER..
  X1
  CONCAT
  R
  end
  subgraph SF[SEQUENCE FEATS..]
  direction TB
  I1["emb seq 1" ];
  IK[...];
  IN["emb seq n"];
  end
  subgraph OF[OTHER FEATS..]
  direction TB
  IO["emb oth"]
  end

The sequence_concat combiner assumes at least one output from encoders is a tensors of size b x s x h where b is the batch size, s is the length of the sequence and h is the hidden dimension. A sequence-like (sequence, text or time series) input feature can be specified with the main_sequence_feature parameter which takes the name of sequence-like input feature as its value. If no main_sequence_feature is specified, the combiner will look through all the features in the order they are defined in the configuration and will look for a feature with a rank 3 tensor output (sequence, text or time series). If it cannot find one it will raise an exception, otherwise the output of that feature will be used for concatenating the other features along the sequence s dimension.

If there are other input features with a rank 3 output tensor, the combiner will concatenate them alongside the s dimension, which means that all of them must have identical s dimension, otherwise a dimension mismatch error will be returned thrown during training when a datapoint with two sequential features of different lengths are provided.

Other features that have a b x h rank 2 tensor output will be replicated s times and concatenated to the s dimension. The final output is a b x s x h' tensor where h' is the size of the concatenation of the h dimensions of all input features.

combiner:
    type: sequence_concat
    main_sequence_feature: null
    reduce_output: null

Parameters:

main_sequence_feature (default: null) : Name of a sequence, text, or time series feature to concatenate the outputs of the other features to. If no main_sequence_feature is specified, the combiner will look through all the features in the order they are defined in the configuration and will look for a feature with a rank 3 tensor output (sequence, text or time series). If it cannot find one it will raise an exception, otherwise the output of that feature will be used for concatenating the other features along the sequence s dimension. If there are other input features with a rank 3 output tensor, the combiner will concatenate them alongside the s dimension. All sequence-like input features must have identical s dimension, otherwise an error will be thrown.
reduce_output (default: null): Strategy to use to aggregate the embeddings of the items of the set. Options: last, sum, mean, avg, max, concat, attention, none, None, null.

Sequence Combiner¶

graph LR
  I1 --> X1{Tile};
  IK --> X1;
  IN --> X1;
  IO --> X1;

  X1 --> SC1;
  X1 --> SCK;
  X1 --> SCN;

  SC1 --> R["Sequence Encoder"];
  SCK --> R;
  SCN --> R;
  R --> ...;
  subgraph CONCAT["TENSOR.."]
    direction TB
    SC1["emb seq 1 | emb oth" ];
    SCK[...];
    SCN["emb seq n | emb oth"];
  end
  subgraph COMBINER..
  X1
  CONCAT
  R
  end
  subgraph SF[SEQUENCE FEATS..]
  direction TB
  I1["emb seq 1" ];
  IK[...];
  IN["emb seq n"];
  end
  subgraph OF[OTHER FEATS..]
  direction TB
  IO["emb oth"]
  end

The sequence combiner stacks a sequence concat combiner with a sequence encoder. All the considerations about input tensor ranks described for the sequence concat combiner apply also in this case, but the main difference is that this combiner uses the b x s x h' output of the sequence concat combiner, where b is the batch size, s is the sequence length and h' is the sum of the hidden dimensions of all input features, as input for any of the sequence encoders described in the sequence features encoders section. Refer to that section for more detailed information about the sequence encoders and their parameters. All considerations on the shape of the outputs for the sequence encoders also apply to sequence combiner.

combiner:
    type: sequence
    main_sequence_feature: null
    reduce_output: null
    encoder:
        type: parallel_cnn
        skip: false
        dropout: 0.0
        activation: relu
        max_sequence_length: null
        representation: dense
        vocab: null
        use_bias: true
        bias_initializer: zeros
        weights_initializer: xavier_uniform
        should_embed: true
        embedding_size: 256
        embeddings_on_cpu: false
        embeddings_trainable: true
        pretrained_embeddings: null
        reduce_output: sum
        num_conv_layers: null
        conv_layers: null
        num_filters: 256
        filter_size: 3
        pool_function: max
        pool_size: null
        output_size: 256
        norm: null
        norm_params: null
        num_fc_layers: null
        fc_layers: null

Parameters:

main_sequence_feature (default: null) : Name of a sequence, text, or time series feature to concatenate the outputs of the other features to. If no main_sequence_feature is specified, the combiner will look through all the features in the order they are defined in the configuration and will look for a feature with a rank 3 tensor output (sequence, text or time series). If it cannot find one it will raise an exception, otherwise the output of that feature will be used for concatenating the other features along the sequence s dimension. If there are other input features with a rank 3 output tensor, the combiner will concatenate them alongside the s dimension. All sequence-like input features must have identical s dimension, otherwise an error will be thrown.
reduce_output (default: null): Strategy to use to aggregate the embeddings of the items of the set. Options: last, sum, mean, avg, max, concat, attention, none, None, null.
encoder (default: {"type": "parallel_cnn"}): Encoder to apply to main_sequence_feature. The encoder must produce a tensor of size [batch_size, sequence_length, hidden_size]

TabNet Combiner¶

graph LR
  I1[Encoder Output 1] --> C[TabNet];
  IK[...] --> C;
  IN[Encoder Output N] --> C;
  C --> ...;

The tabnet combiner implements the TabNet model, which uses attention and sparsity to achieve high performance on tabular data. It assumes all outputs from encoders are tensors of size b x h where b is the batch size and h is the hidden dimension, which can be different for each input. If the input tensors have a different shape, it automatically flattens them. It returns the final b x h' tensor where h' is the user-specified output size.

combiner:
    type: tabnet
    size: 32
    dropout: 0.05
    output_size: 128
    num_steps: 3
    num_total_blocks: 4
    num_shared_blocks: 2
    relaxation_factor: 1.5
    bn_epsilon: 0.001
    bn_momentum: 0.05
    bn_virtual_bs: 1024
    sparsity: 0.0001
    entmax_mode: sparsemax
    entmax_alpha: 1.5

Parameters:

size (default: 32) : Size of the hidden layers. N_a in (Arik and Pfister, 2019).
dropout (default: 0.05) : Dropout rate for the transformer block.
output_size (default: 128) : Output size of a fully connected layer. N_d in (Arik and Pfister, 2019).
num_steps (default: 3): Number of steps / repetitions of the the attentive transformer and feature transformer computations. N_steps in (Arik and Pfister, 2019).
num_total_blocks (default: 4): Total number of feature transformer blocks at each step.
num_shared_blocks (default: 2): Number of shared feature transformer blocks across the steps.
relaxation_factor (default: 1.5): Factor that influences how many times a feature should be used across the steps of computation. a value of 1 implies it each feature should be use once, a higher value allows for multiple usages. gamma in (Arik and Pfister, 2019).
bn_epsilon (default: 0.001): Epsilon to be added to the batch norm denominator.
bn_momentum (default: 0.05): Momentum of the batch norm. 1 - m_B from the TabNet paper.
bn_virtual_bs (default: 1024): Size of the virtual batch size used by ghost batch norm. If null, regular batch norm is used instead. B_v from the TabNet paper. See Ghost Batch Normalization for details.
sparsity (default: 0.0001): Multiplier of the sparsity inducing loss. lambda_sparse in (Arik and Pfister, 2019).
entmax_mode (default: sparsemax): Entmax is a sparse family of probability mapping which generalizes softmax and sparsemax. entmax_mode controls the sparsity Options: entmax15, sparsemax, constant, adaptive.
entmax_alpha (default: 1.5): Must be a number between 1.0 and 2.0. If entmax_mode is adaptive, entmax_alpha is used as the initial value for the learnable parameter. 1 corresponds to softmax, 2 is sparsemax.

Transformer Combiner¶

graph LR
  I1[Encoder Output 1] --> C["Transformer Stack"];
  IK[...] --> C;
  IN[Encoder Output N] --> C;
  C --> R[Reduce];
  R --> FC[Fully Connected Layers];
  FC --> ...;
  subgraph COMBINER..
  C
  R
  FC
  end

The transformer combiner combines input features using a stack of Transformer blocks (from Attention Is All You Need). It assumes all outputs from encoders are tensors of size b x h where b is the batch size and h is the hidden dimension, which can be different for each input. If the input tensors have a different shape, it automatically flattens them. It then projects each input tensor to the same hidden / embedding size and encodes them with a stack of Transformer layers. Finally, the transformer combiner applies a reduction to the outputs of the Transformer stack, followed by optional fully connected layers. The output is a b x h' tensor where h' is the size of the last fully connected layer or the hidden / embedding size, or a b x n x h' where n is the number of input features and h' is the hidden / embedding size if no reduction is applied.

Resources to learn more about transformers:

CS480/680 Lecture 19: Attention and Transformer Networks (VIDEO)
Attention is all you need - Attentional Neural Network Models Masterclass (VIDEO)
Illustrated: Self-Attention (Colab notebook)

combiner:
    type: transformer
    dropout: 0.1
    num_fc_layers: 0
    output_size: 256
    norm: null
    fc_dropout: 0.0
    transformer_output_size: 256
    hidden_size: 256
    num_layers: 1
    num_heads: 8
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    norm_params: null
    fc_layers: null
    fc_activation: relu
    fc_residual: false
    reduce_output: mean

Parameters:

dropout (default: 0.1) : Dropout rate for the transformer block.
num_fc_layers (default: 0) : The number of stacked fully connected layers (only applies if reduce_output is not null).
output_size (default: 256) : Output size of a fully connected layer.
norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null. See Normalization for details.
fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
transformer_output_size (default: 256): Size of the fully connected layer after self attention in the transformer block. This is usually the same as hidden_size and embedding_size.
hidden_size (default: 256): The number of hidden units of the TransformerStack as well as the dimension that each incoming input feature is projected to before feeding to the TransformerStack.
num_layers (default: 1): The number of transformer layers.
num_heads (default: 8): Number of heads of the self attention in the transformer block.
use_bias (default: true): Whether the layer uses a bias vector. Options: true, false.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
norm_params (default: null): Default parameters passed to the norm module. See Normalization for details.
fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
fc_residual (default: false): Whether to add a residual connection to each fully connected layer block. Requires all fully connected layers to have the same output_size. Options: true, false.
reduce_output (default: mean): Strategy to use to aggregate the output of the transformer. Options: last, sum, mean, avg, max, concat, attention, none, None, null.

TabTransformer Combiner¶

graph LR
  I1[Cat Emb 1] --> T1["Concat"];
  IK[...] --> T1;
  IN[Cat Emb N] --> T1;
  N1[Number ...] --> T4["Concat"];
  B1[Binary ...] --> T4;
  T1 --> T2["Transformer"];
  T2 --> T3["Reduce"];
  T3 --> T4;
  T4 --> T5["FC Layers"];
  T5 --> ...;
  subgraph COMBINER..
  CAT
  T4
  T5
  end
  subgraph ENCODER OUT..
  I1
  IK
  IN
  N1
  B1
  end
  subgraph CAT["CATEGORY PIPELINE.."]
  direction TB
  T1
  T2
  T3
  end

The tabtransformer combiner combines input features in the following sequence of operations. The combiner projects all encoder outputs except binary and number features into an embedding space. These features are concatenated as if they were a sequence and passed through a transformer. After the transformer, the number and binary features are concatenated (which are of size 1) and then concatenated with the output of the transformer and is passed to a stack of fully connected layers (from TabTransformer: Tabular Data Modeling Using Contextual Embeddings). It assumes all outputs from encoders are tensors of size b x h where b is the batch size and h is the hidden dimension, which can be different for each input. If the input tensors have a different shape, it automatically flattens them. It then projects each input tensor to the same hidden / embedding size and encodes them with a stack of Transformer layers. Finally, the transformer combiner applies a reduction to the outputs of the Transformer stack, followed by the above concatenation and optional fully connected layers. The output is a b x h' tensor where h' is the size of the last fully connected layer or the hidden / embedding size, or a b x n x h' where n is the number of input features and h' is the hidden / embedding size if no reduction is applied.

combiner:
    type: tabtransformer
    dropout: 0.1
    num_fc_layers: 0
    output_size: 256
    norm: null
    fc_dropout: 0.0
    embed_input_feature_name: null
    transformer_output_size: 256
    hidden_size: 256
    num_layers: 1
    num_heads: 8
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    norm_params: null
    fc_layers: null
    fc_activation: relu
    fc_residual: false
    reduce_output: concat

Parameters:

dropout (default: 0.1) : Dropout rate for the transformer block.
num_fc_layers (default: 0) : The number of stacked fully connected layers (only applies if reduce_output is not null).
output_size (default: 256) : Output size of a fully connected layer.
norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null. See Normalization for details.
fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
embed_input_feature_name (default: null) : This value controls the size of the embeddings. Valid values are add which uses the hidden_size value or an integer that is set to a specific value. In the case of an integer value, it must be smaller than hidden_size.
transformer_output_size (default: 256): Size of the fully connected layer after self attention in the transformer block. This is usually the same as hidden_size and embedding_size.
hidden_size (default: 256): The number of hidden units of the TransformerStack as well as the dimension that each incoming input feature is projected to before feeding to the TransformerStack.
num_layers (default: 1): The number of transformer layers.
num_heads (default: 8): Number of heads of the self attention in the transformer block.
use_bias (default: true): Whether the layer uses a bias vector. Options: true, false.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
norm_params (default: null): Default parameters passed to the norm module. See Normalization for details.
fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
fc_residual (default: false): Whether to add a residual connection to each fully connected layer block. Requires all fully connected layers to have the same output_size. Options: true, false.
reduce_output (default: concat): Strategy to use to aggregate the output of the transformer. Options: last, sum, mean, avg, max, concat, attention, none, None, null.

Comparator Combiner¶

graph LR
  I1[Entity 1 Embed 1] --> C1[Concat];
  IK[...] --> C1;
  IN[Entity 1 Embed N] --> C1;
  C1 --> FC1[FC Layers];
  FC1 --> COMP[Compare];

  I2[Entity 2 Embed 1] --> C2[Concat];
  IK2[...] --> C2;
  IN2[Entity 2 Embed N] --> C2;
  C2 --> FC2[FC Layers];
  FC2 --> COMP;

  COMP --> ...;

  subgraph ENTITY1["ENTITY 1.."]
  I1
  IK
  IN
  end

  subgraph ENTITY2["ENTITY 2.."]
  I2
  IK2
  IN2
  end

  subgraph COMBINER..
  C1
  FC1
  C2
  FC2
  COMP
  end

The comparator combiner compares the hidden representation of two entities defined by lists of features. It assumes all outputs from encoders are tensors of size b x h where b is the batch size and h is the hidden dimension, which can be different for each input. If the input tensors have a different shape, it automatically flattens them. It then concatenates the representations of each entity and projects them both to vectors of size output_size. Finally, it compares the two entity representations by dot product, element-wise multiplication, absolute difference and bilinear product. It returns the final b x h' tensor where h' is the size of the concatenation of the four comparisons.

combiner:
    type: comparator
    entity_1:
    - feature_1
    - feature_2
    entity_2:
    - feature_3
    dropout: 0.0
    num_fc_layers: 1
    output_size: 256
    norm: null
    activation: relu
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    norm_params: null
    fc_layers: null

Parameters:

entity_1 (default: null) : The list of input feature names [feature_1, feature_2, ...] constituting the first entity to compare. Required.
entity_2 (default: null) : The list of input feature names [feature_1, feature_2, ...] constituting the second entity to compare. Required.
dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
num_fc_layers (default: 1) : Number of stacked fully connected layers to apply. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
output_size (default: 256) : Output size of a fully connected layer.
norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null. See Normalization for details.
activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
use_bias (default: true): Whether the layer uses a bias vector. Options: true, false.
bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.
norm_params (default: null): Default parameters passed to the norm module. See Normalization for details.
fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.

Common Parameters¶

These parameters are used across multiple combiners (and some encoders / decoders) in similar ways.

Normalization¶

Normalization applied at the beginnging of the fully-connected stack. If a norm is not already specified for the fc_layers this is the default norm that will be used for each layer. One of:

null: no normalization
batch: batch normalization
layer: layer normalization
ghost: ghost batch normalization

Batch Normalization¶

Applies Batch Normalization as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. See PyTorch documentation on batch normalization for more details.

norm: batch
norm_params:
  eps: 0.001
  momentum: 0.1
  affine: true
  track_running_stats: true

Parameters:

eps (default: 0.001): Epsilon to be added to the batch norm denominator.
momentum (default: 0.1): The value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1.
affine (default: true): A boolean value that when set to true, this module has learnable affine parameters.
track_running_stats (default: true): A boolean value that when set to true, this module tracks the running mean and variance, and when set to false, this module does not track such statistics, and initializes statistics buffers running_mean and running_var as null. When these buffers are null, this module always uses batch statistics. in both training and eval modes.

Layer Normalization¶

Applies Layer Normalization over a mini-batch of inputs as described in the paper Layer Normalization. See PyTorch documentation on layer normalization for more details.

norm: layer
norm_params:
  eps: 0.00001
  elementwise_affine: true

Parameters:

eps (default: 0.00001): A value added to the denominator for numerical stability.
elementwise_affine (default: true): A boolean value that when set to true, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases)

Ghost Batch Normalization¶

Ghost Batch Norm is a technique designed to address the "generalization gap" whereby the training process breaks down with very large batch sizes. If you are using a large batch size (typically in the thousands) to maximize GPU utilization, but the model is not converging well, enabling ghost batch norm can be a useful technique to improve convergence.

When using ghost batch norm, you specify a virtual_batch_size (default 128) representing the "ideal" batch size to train with (ignoring throughput or GPU utilization). The ghost batch norm will then subdivide each batch into subbatches of size virtual_batch_size and apply batch normalization to each.

A notable downside to ghost batch norm is that it is more computationally expensive than traditional batch norm, so it is only recommended to use it when the batch size that maximizes throughput is significantly higher than the batch size that yields the best convergence (one or more orders of magnitude higher).

The approach was introduced in Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks and since popularized by its use in TabNet.

norm: ghost
norm_params:
  virtual_batch_size: 128
  epsilon: 0.001
  momentum: 0.05

Parameters:

virtual_batch_size (default: 128): Size of the virtual batch size used by ghost batch norm. If null, regular batch norm is used instead. B_v from the TabNet paper.
epsilon (default: 0.001): Epsilon to be added to the batch norm denominator.
momentum (default: 0.05): Momentum of the batch norm. 1 - m_B from the TabNet paper.