Skip to content

Combiner

Combiners take the outputs of all input features encoders and combine them before providing the combined representation to the output feature decoders.

You can specify which one to use in the combiner section of the configuration, and if you don't specify a combiner, the concat combiner will be used.

Combiner Types

Concat Combiner

graph LR
  I1[Encoder Output 1] --> C[Concat];
  IK[...] --> C;
  IN[Encoder Output N] --> C;
  C --> FC[Fully Connected Layers];
  FC --> ...;
  subgraph COMBINER..
  C
  FC
  end

The concat combiner assumes all outputs from encoders are tensors of size b x h where b is the batch size and h is the hidden dimension, which can be different for each input. If any inputs have more than 2 dimensions, a sequence or set feature for example, set the flatten_inputs parameter to true. It concatenates along the h dimension, and then (optionally) passes the concatenated tensor through a stack of fully connected layers. It returns the final b x h' tensor where h' is the size of the last fully connected layer or the sum of the sizes of the h of all inputs in the case there are no fully connected layers. If only a single input feature and no fully connected layer is specified, the output of the input feature encoder is passed through the combiner unchanged.

combiner:
    type: concat
    dropout: 0.0
    num_fc_layers: 0
    output_size: 256
    norm: null
    activation: relu
    flatten_inputs: false
    residual: false
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    norm_params: null
    fc_layers: null
    batch_ensemble: false
    num_ensemble_members: 4

Parameters:

  • dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • num_fc_layers (default: 0) : Number of stacked fully connected layers to apply. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
  • output_size (default: 256) : Output size of a fully connected layer.
  • norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null. See Normalization for details.
  • activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
  • flatten_inputs (default: false): Whether to flatten input tensors to a vector.
  • residual (default: false): Whether to add a residual connection to each fully connected layer block. Requires all fully connected layers to have the same output_size.
  • use_bias (default: true): Whether the layer uses a bias vector.
  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • norm_params (default: null): Default parameters passed to the norm module. See Normalization for details.

  • fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
  • batch_ensemble (default: false): Whether to use BatchEnsemble (TabM-style) for parameter-efficient ensembling. Adds per-member rank-1 scaling vectors to the output layer, providing ensemble-level performance at single-model cost.
  • num_ensemble_members (default: 4): Number of ensemble members when batch_ensemble is enabled.

Sequence Concat Combiner

graph LR
  I1 --> X1{Tile};
  IK --> X1;
  IN --> X1;
  IO --> X1;

  X1 --> SC1;
  X1 --> SCK;
  X1 --> SCN;

  SC1 --> R[Reduce];
  SCK --> R;
  SCN --> R;
  R --> ...;
  subgraph CONCAT["TENSOR.."]
    direction TB
    SC1["emb seq 1 | emb oth" ];
    SCK[...];
    SCN["emb seq n | emb oth"];
  end
  subgraph COMBINER..
  X1
  CONCAT
  R
  end
  subgraph SF[SEQUENCE FEATS..]
  direction TB
  I1["emb seq 1" ];
  IK[...];
  IN["emb seq n"];
  end
  subgraph OF[OTHER FEATS..]
  direction TB
  IO["emb oth"]
  end

The sequence_concat combiner assumes at least one output from encoders is a tensors of size b x s x h where b is the batch size, s is the length of the sequence and h is the hidden dimension. A sequence-like (sequence, text or time series) input feature can be specified with the main_sequence_feature parameter which takes the name of sequence-like input feature as its value. If no main_sequence_feature is specified, the combiner will look through all the features in the order they are defined in the configuration and will look for a feature with a rank 3 tensor output (sequence, text or time series). If it cannot find one it will raise an exception, otherwise the output of that feature will be used for concatenating the other features along the sequence s dimension.

If there are other input features with a rank 3 output tensor, the combiner will concatenate them alongside the s dimension, which means that all of them must have identical s dimension, otherwise a dimension mismatch error will be returned thrown during training when a datapoint with two sequential features of different lengths are provided.

Other features that have a b x h rank 2 tensor output will be replicated s times and concatenated to the s dimension. The final output is a b x s x h' tensor where h' is the size of the concatenation of the h dimensions of all input features.

combiner:
    type: sequence_concat
    main_sequence_feature: null
    reduce_output: null

Parameters:

  • main_sequence_feature (default: null) : Name of a sequence, text, or time series feature to concatenate the outputs of the other features to. If no main_sequence_feature is specified, the combiner will look through all the features in the order they are defined in the configuration and will look for a feature with a rank 3 tensor output (sequence, text or time series). If it cannot find one it will raise an exception, otherwise the output of that feature will be used for concatenating the other features along the sequence s dimension. If there are other input features with a rank 3 output tensor, the combiner will concatenate them alongside the s dimension. All sequence-like input features must have identical s dimension, otherwise an error will be thrown.

  • reduce_output (default: null): Strategy to use to aggregate the embeddings of the items of the set. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.

Sequence Combiner

graph LR
  I1 --> X1{Tile};
  IK --> X1;
  IN --> X1;
  IO --> X1;

  X1 --> SC1;
  X1 --> SCK;
  X1 --> SCN;

  SC1 --> R["Sequence Encoder"];
  SCK --> R;
  SCN --> R;
  R --> ...;
  subgraph CONCAT["TENSOR.."]
    direction TB
    SC1["emb seq 1 | emb oth" ];
    SCK[...];
    SCN["emb seq n | emb oth"];
  end
  subgraph COMBINER..
  X1
  CONCAT
  R
  end
  subgraph SF[SEQUENCE FEATS..]
  direction TB
  I1["emb seq 1" ];
  IK[...];
  IN["emb seq n"];
  end
  subgraph OF[OTHER FEATS..]
  direction TB
  IO["emb oth"]
  end

The sequence combiner stacks a sequence concat combiner with a sequence encoder. All the considerations about input tensor ranks described for the sequence concat combiner apply also in this case, but the main difference is that this combiner uses the b x s x h' output of the sequence concat combiner, where b is the batch size, s is the sequence length and h' is the sum of the hidden dimensions of all input features, as input for any of the sequence encoders described in the sequence features encoders section. Refer to that section for more detailed information about the sequence encoders and their parameters. All considerations on the shape of the outputs for the sequence encoders also apply to sequence combiner.

combiner:
    type: sequence
    main_sequence_feature: null
    reduce_output: null
    encoder:
        type: parallel_cnn
        skip: false
        adapter: null
        dropout: 0.0
        activation: relu
        max_sequence_length: null
        representation: dense
        vocab: null
        use_bias: true
        bias_initializer: zeros
        weights_initializer: xavier_uniform
        should_embed: true
        embedding_size: 256
        embeddings_on_cpu: false
        embeddings_trainable: true
        pretrained_embeddings: null
        reduce_output: sum
        num_conv_layers: null
        conv_layers: null
        num_filters: 256
        filter_size: 3
        pool_function: max
        pool_size: null
        output_size: 256
        norm: null
        norm_params: null
        num_fc_layers: null
        fc_layers: null

Parameters:

  • main_sequence_feature (default: null) : Name of a sequence, text, or time series feature to concatenate the outputs of the other features to. If no main_sequence_feature is specified, the combiner will look through all the features in the order they are defined in the configuration and will look for a feature with a rank 3 tensor output (sequence, text or time series). If it cannot find one it will raise an exception, otherwise the output of that feature will be used for concatenating the other features along the sequence s dimension. If there are other input features with a rank 3 output tensor, the combiner will concatenate them alongside the s dimension. All sequence-like input features must have identical s dimension, otherwise an error will be thrown.

  • reduce_output (default: null): Strategy to use to aggregate the embeddings of the items of the set. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.

  • encoder (default: null):

TabNet Combiner

graph LR
  I1[Encoder Output 1] --> C[TabNet];
  IK[...] --> C;
  IN[Encoder Output N] --> C;
  C --> ...;

The tabnet combiner implements the TabNet model, which uses attention and sparsity to achieve high performance on tabular data. It assumes all outputs from encoders are tensors of size b x h where b is the batch size and h is the hidden dimension, which can be different for each input. If the input tensors have a different shape, it automatically flattens them. It returns the final b x h' tensor where h' is the user-specified output size.

combiner:
    type: tabnet
    size: 32
    dropout: 0.05
    output_size: 128
    num_steps: 3
    num_total_blocks: 4
    num_shared_blocks: 2
    relaxation_factor: 1.5
    bn_epsilon: 0.001
    bn_momentum: 0.05
    bn_virtual_bs: 1024
    sparsity: 0.0001
    entmax_mode: sparsemax
    entmax_alpha: 1.5

Parameters:

  • size (default: 32) : Size of the hidden layers. N_a in (Arik and Pfister, 2019).
  • dropout (default: 0.05) : Dropout rate for the transformer block.
  • output_size (default: 128) : Output size of a fully connected layer. N_d in (Arik and Pfister, 2019).
  • num_steps (default: 3): Number of steps / repetitions of the the attentive transformer and feature transformer computations. N_steps in (Arik and Pfister, 2019).
  • num_total_blocks (default: 4): Total number of feature transformer blocks at each step.
  • num_shared_blocks (default: 2): Number of shared feature transformer blocks across the steps.
  • relaxation_factor (default: 1.5): Factor that influences how many times a feature should be used across the steps of computation. a value of 1 implies it each feature should be use once, a higher value allows for multiple usages. gamma in (Arik and Pfister, 2019).
  • bn_epsilon (default: 0.001): Epsilon to be added to the batch norm denominator.
  • bn_momentum (default: 0.05): Momentum of the batch norm. 1 - m_B from the TabNet paper.
  • bn_virtual_bs (default: 1024): Size of the virtual batch size used by ghost batch norm. If null, regular batch norm is used instead. B_v from the TabNet paper. See Ghost Batch Normalization for details.
  • sparsity (default: 0.0001): Multiplier of the sparsity inducing loss. lambda_sparse in (Arik and Pfister, 2019).
  • entmax_mode (default: sparsemax): Entmax is a sparse family of probability mapping which generalizes softmax and sparsemax. entmax_mode controls the sparsity Options: entmax15, sparsemax, constant, adaptive.
  • entmax_alpha (default: 1.5): Must be a number between 1.0 and 2.0. If entmax_mode is adaptive, entmax_alpha is used as the initial value for the learnable parameter. 1 corresponds to softmax, 2 is sparsemax.

Transformer Combiner

graph LR
  I1[Encoder Output 1] --> C["Transformer Stack"];
  IK[...] --> C;
  IN[Encoder Output N] --> C;
  C --> R[Reduce];
  R --> FC[Fully Connected Layers];
  FC --> ...;
  subgraph COMBINER..
  C
  R
  FC
  end

The transformer combiner combines input features using a stack of Transformer blocks (from Attention Is All You Need). It assumes all outputs from encoders are tensors of size b x h where b is the batch size and h is the hidden dimension, which can be different for each input. If the input tensors have a different shape, it automatically flattens them. It then projects each input tensor to the same hidden / embedding size and encodes them with a stack of Transformer layers. Finally, the transformer combiner applies a reduction to the outputs of the Transformer stack, followed by optional fully connected layers. The output is a b x h' tensor where h' is the size of the last fully connected layer or the hidden / embedding size, or a b x n x h' where n is the number of input features and h' is the hidden / embedding size if no reduction is applied.

Resources to learn more about transformers:

combiner:
    type: transformer
    dropout: 0.1
    num_fc_layers: 0
    output_size: 256
    norm: null
    fc_dropout: 0.0
    transformer_output_size: 256
    hidden_size: 256
    num_layers: 1
    num_heads: 8
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    norm_params: null
    fc_layers: null
    fc_activation: relu
    fc_residual: false
    reduce_output: mean

Parameters:

  • dropout (default: 0.1) : Dropout rate for the transformer block.
  • num_fc_layers (default: 0) : The number of stacked fully connected layers (only applies if reduce_output is not null).
  • output_size (default: 256) : Output size of a fully connected layer.
  • norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null. See Normalization for details.
  • fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • transformer_output_size (default: 256): Size of the fully connected layer after self attention in the transformer block. This is usually the same as hidden_size and embedding_size.
  • hidden_size (default: 256): The number of hidden units of the TransformerStack as well as the dimension that each incoming input feature is projected to before feeding to the TransformerStack.
  • num_layers (default: 1): The number of transformer layers.
  • num_heads (default: 8): Number of heads of the self attention in the transformer block.
  • use_bias (default: true): Whether the layer uses a bias vector.
  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • norm_params (default: null): Default parameters passed to the norm module. See Normalization for details.

  • fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
  • fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
  • fc_residual (default: false): Whether to add a residual connection to each fully connected layer block. Requires all fully connected layers to have the same output_size.
  • reduce_output (default: mean): Strategy to use to aggregate the output of the transformer. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.

TabTransformer Combiner

graph LR
  I1[Cat Emb 1] --> T1["Concat"];
  IK[...] --> T1;
  IN[Cat Emb N] --> T1;
  N1[Number ...] --> T4["Concat"];
  B1[Binary ...] --> T4;
  T1 --> T2["Transformer"];
  T2 --> T3["Reduce"];
  T3 --> T4;
  T4 --> T5["FC Layers"];
  T5 --> ...;
  subgraph COMBINER..
  CAT
  T4
  T5
  end
  subgraph ENCODER OUT..
  I1
  IK
  IN
  N1
  B1
  end
  subgraph CAT["CATEGORY PIPELINE.."]
  direction TB
  T1
  T2
  T3
  end

The tabtransformer combiner combines input features in the following sequence of operations. The combiner projects all encoder outputs except binary and number features into an embedding space. These features are concatenated as if they were a sequence and passed through a transformer. After the transformer, the number and binary features are concatenated (which are of size 1) and then concatenated with the output of the transformer and is passed to a stack of fully connected layers (from TabTransformer: Tabular Data Modeling Using Contextual Embeddings). It assumes all outputs from encoders are tensors of size b x h where b is the batch size and h is the hidden dimension, which can be different for each input. If the input tensors have a different shape, it automatically flattens them. It then projects each input tensor to the same hidden / embedding size and encodes them with a stack of Transformer layers. Finally, the transformer combiner applies a reduction to the outputs of the Transformer stack, followed by the above concatenation and optional fully connected layers. The output is a b x h' tensor where h' is the size of the last fully connected layer or the hidden / embedding size, or a b x n x h' where n is the number of input features and h' is the hidden / embedding size if no reduction is applied.

combiner:
    type: tabtransformer
    dropout: 0.1
    num_fc_layers: 0
    output_size: 256
    norm: null
    fc_dropout: 0.0
    embed_input_feature_name: null
    transformer_output_size: 256
    hidden_size: 256
    num_layers: 1
    num_heads: 8
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    norm_params: null
    fc_layers: null
    fc_activation: relu
    fc_residual: false
    reduce_output: concat

Parameters:

  • dropout (default: 0.1) : Dropout rate for the transformer block.
  • num_fc_layers (default: 0) : The number of stacked fully connected layers (only applies if reduce_output is not null).
  • output_size (default: 256) : Output size of a fully connected layer.
  • norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null. See Normalization for details.
  • fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • embed_input_feature_name (default: null) : This value controls the size of the embeddings. Valid values are add which uses the hidden_size value or an integer that is set to a specific value. In the case of an integer value, it must be smaller than hidden_size.
  • transformer_output_size (default: 256): Size of the fully connected layer after self attention in the transformer block. This is usually the same as hidden_size and embedding_size.
  • hidden_size (default: 256): The number of hidden units of the TransformerStack as well as the dimension that each incoming input feature is projected to before feeding to the TransformerStack.
  • num_layers (default: 1): The number of transformer layers.
  • num_heads (default: 8): Number of heads of the self attention in the transformer block.
  • use_bias (default: true): Whether the layer uses a bias vector.
  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • norm_params (default: null): Default parameters passed to the norm module. See Normalization for details.

  • fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
  • fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
  • fc_residual (default: false): Whether to add a residual connection to each fully connected layer block. Requires all fully connected layers to have the same output_size.
  • reduce_output (default: concat): Strategy to use to aggregate the output of the transformer. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.

Comparator Combiner

graph LR
  I1[Entity 1 Embed 1] --> C1[Concat];
  IK[...] --> C1;
  IN[Entity 1 Embed N] --> C1;
  C1 --> FC1[FC Layers];
  FC1 --> COMP[Compare];

  I2[Entity 2 Embed 1] --> C2[Concat];
  IK2[...] --> C2;
  IN2[Entity 2 Embed N] --> C2;
  C2 --> FC2[FC Layers];
  FC2 --> COMP;

  COMP --> ...;

  subgraph ENTITY1["ENTITY 1.."]
  I1
  IK
  IN
  end

  subgraph ENTITY2["ENTITY 2.."]
  I2
  IK2
  IN2
  end

  subgraph COMBINER..
  C1
  FC1
  C2
  FC2
  COMP
  end

The comparator combiner compares the hidden representation of two entities defined by lists of features. It assumes all outputs from encoders are tensors of size b x h where b is the batch size and h is the hidden dimension, which can be different for each input. If the input tensors have a different shape, it automatically flattens them. It then concatenates the representations of each entity and projects them both to vectors of size output_size. Finally, it compares the two entity representations by dot product, element-wise multiplication, absolute difference and bilinear product. It returns the final b x h' tensor where h' is the size of the concatenation of the four comparisons.

combiner:
    type: comparator
    entity_1:
    - feature_1
    - feature_2
    entity_2:
    - feature_3
    dropout: 0.0
    num_fc_layers: 1
    output_size: 256
    norm: null
    activation: relu
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    norm_params: null
    fc_layers: null

Parameters:

  • entity_1 (default: null) : The list of input feature names [feature_1, feature_2, ...] constituting the first entity to compare. Required.
  • entity_2 (default: null) : The list of input feature names [feature_1, feature_2, ...] constituting the second entity to compare. Required.
  • dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • num_fc_layers (default: 1) : Number of stacked fully connected layers to apply. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
  • output_size (default: 256) : Output size of a fully connected layer.
  • norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null. See Normalization for details.
  • activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
  • use_bias (default: true): Whether the layer uses a bias vector.
  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • norm_params (default: null): Default parameters passed to the norm module. See Normalization for details.

  • fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.

FT-Transformer Combiner

The ft_transformer combiner implements the FT-Transformer architecture (Gorishniy et al., NeurIPS 2021). Each encoder output is projected to a token embedding, a learnable [CLS] token is prepended, and the full sequence is passed through a stack of Transformer self-attention layers. The output is the [CLS] token embedding, optionally followed by fully connected layers.

This is the recommended combiner for tabular data with 3+ input features, as it learns cross-feature interactions through attention while maintaining a fixed-size output via the [CLS] token.

combiner:
    type: ft_transformer
    dropout: 0.1
    num_fc_layers: 0
    output_size: 256
    norm: null
    fc_dropout: 0.0
    transformer_output_size: 256
    hidden_size: 256
    num_layers: 1
    num_heads: 8
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    norm_params: null
    fc_layers: null
    fc_activation: relu
    fc_residual: false

Parameters:

  • dropout (default: 0.1) : Dropout rate for the transformer block.
  • num_fc_layers (default: 0) : The number of stacked fully connected layers (only applies if reduce_output is not null).
  • output_size (default: 256) : Output size of a fully connected layer.
  • norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null. See Normalization for details.
  • fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
  • transformer_output_size (default: 256): Size of the fully connected layer after self attention in the transformer block. This is usually the same as hidden_size and embedding_size.
  • hidden_size (default: 256): The number of hidden units of the TransformerStack as well as the dimension that each incoming input feature is projected to before feeding to the TransformerStack.
  • num_layers (default: 1): The number of transformer layers.
  • num_heads (default: 8): Number of heads of the self attention in the transformer block.
  • use_bias (default: true): Whether the layer uses a bias vector.
  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • norm_params (default: null): Default parameters passed to the norm module. See Normalization for details.

  • fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
  • fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
  • fc_residual (default: false): Whether to add a residual connection to each fully connected layer block. Requires all fully connected layers to have the same output_size.

Cross-Attention Combiner

The cross_attention combiner uses pairwise multi-head cross-attention between all input features. Each feature selectively queries relevant information from all other features. Research consistently shows 2-10% improvement over concatenation when combining heterogeneous modalities (e.g., text + tabular, image + tabular).

combiner:
    type: cross_attention
    num_fc_layers: 0
    norm: null
    activation: relu
    fc_layers: null
    weights_initializer: xavier_uniform
    bias_initializer: zeros
    norm_params: null
    hidden_size: 256
    num_heads: 8
    num_layers: 1
    dropout: 0.1
    output_size: 256
    use_bias: true

Parameters:

  • num_fc_layers (default: 0) : Number of stacked fully connected layers to apply. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
  • norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null. See Normalization for details.
  • activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
  • fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • norm_params (default: null): Default parameters passed to the norm module. See Normalization for details.

  • hidden_size (default: 256): Hidden size of the cross-attention layers. Each input feature is projected to this size.
  • num_heads (default: 8): Number of attention heads in the cross-attention layers.
  • num_layers (default: 1): Number of stacked cross-attention layers.
  • dropout (default: 0.1): Dropout rate for the cross-attention layers.
  • output_size (default: 256): Output size of the fully connected layer after cross-attention.
  • use_bias (default: true): Whether the layer uses a bias vector.

Perceiver Combiner

The perceiver combiner implements a Perceiver IO-style architecture (Jaegle et al., ICML 2022). A set of learnable latent tokens cross-attend to all encoder outputs, then self-attend among themselves. This provides efficient cross-modal fusion with bounded memory, regardless of the number of input features.

combiner:
    type: perceiver
    num_fc_layers: 0
    norm: null
    activation: relu
    fc_layers: null
    weights_initializer: xavier_uniform
    bias_initializer: zeros
    norm_params: null
    num_latents: 32
    latent_dim: 256
    num_heads: 8
    num_self_attention_layers: 2
    dropout: 0.1
    reduce_output: mean
    output_size: 256
    use_bias: true

Parameters:

  • num_fc_layers (default: 0) : Number of stacked fully connected layers to apply. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
  • norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null. See Normalization for details.
  • activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
  • fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • norm_params (default: null): Default parameters passed to the norm module. See Normalization for details.

  • num_latents (default: 32): Number of learned latent query vectors.
  • latent_dim (default: 256): Dimensionality of each latent query vector.
  • num_heads (default: 8): Number of attention heads in the cross-attention and self-attention layers.
  • num_self_attention_layers (default: 2): Number of self-attention layers applied to the latent queries after cross-attention.
  • dropout (default: 0.1): Dropout rate for the attention layers.
  • reduce_output (default: mean): Strategy to use to aggregate the latent vectors before the FC stack. Options: last, sum, mean, avg, max, concat, attention, attention_pooling, none, None, null.
  • output_size (default: 256): Output size of the fully connected layer after the Perceiver block.
  • use_bias (default: true): Whether the layer uses a bias vector.

Gated Fusion Combiner

The gated_fusion combiner uses a gating mechanism inspired by Flamingo (Alayrac et al., NeurIPS 2022). Per-feature gates are initialized near zero, so the model starts with simple concatenation and gradually learns cross-modal residual connections as training progresses. This provides stable training when combining pretrained and randomly initialized components.

combiner:
    type: gated_fusion
    num_fc_layers: 0
    norm: null
    activation: relu
    fc_layers: null
    weights_initializer: xavier_uniform
    bias_initializer: zeros
    norm_params: null
    hidden_size: 256
    dropout: 0.1
    output_size: 256
    use_bias: true

Parameters:

  • num_fc_layers (default: 0) : Number of stacked fully connected layers to apply. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
  • norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null. See Normalization for details.
  • activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
  • fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
  • weights_initializer (default: xavier_uniform): Initializer for the weight matrix. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • bias_initializer (default: zeros): Initializer for the bias vector. Options: uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity. Alternatively it is possible to specify a dictionary with a key type that identifies the type of initializer and other keys for its parameters, e.g. {type: normal, mean: 0, stddev: 0}. For a description of the parameters of each initializer, see torch.nn.init.

  • norm_params (default: null): Default parameters passed to the norm module. See Normalization for details.

  • hidden_size (default: 256): Hidden size of the gating layers. Each input feature is projected to this size.
  • dropout (default: 0.1): Dropout rate for the gating layers.
  • output_size (default: 256): Output size of the fully connected layer after gated fusion.
  • use_bias (default: true): Whether the layer uses a bias vector.

Common Parameters

These parameters are used across multiple combiners (and some encoders / decoders) in similar ways.

Normalization

Normalization applied at the beginnging of the fully-connected stack. If a norm is not already specified for the fc_layers this is the default norm that will be used for each layer. One of:

  • null: no normalization
  • batch: batch normalization
  • layer: layer normalization
  • ghost: ghost batch normalization
Batch Normalization

Applies Batch Normalization as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. See PyTorch documentation on batch normalization for more details.

norm: batch
norm_params:
  eps: 0.001
  momentum: 0.1
  affine: true
  track_running_stats: true

Parameters:

  • eps (default: 0.001): Epsilon to be added to the batch norm denominator.
  • momentum (default: 0.1): The value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1.
  • affine (default: true): A boolean value that when set to true, this module has learnable affine parameters.
  • track_running_stats (default: true): A boolean value that when set to true, this module tracks the running mean and variance, and when set to false, this module does not track such statistics, and initializes statistics buffers running_mean and running_var as null. When these buffers are null, this module always uses batch statistics. in both training and eval modes.
Layer Normalization

Applies Layer Normalization over a mini-batch of inputs as described in the paper Layer Normalization. See PyTorch documentation on layer normalization for more details.

norm: layer
norm_params:
  eps: 0.00001
  elementwise_affine: true

Parameters:

  • eps (default: 0.00001): A value added to the denominator for numerical stability.
  • elementwise_affine (default: true): A boolean value that when set to true, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases)
Ghost Batch Normalization

Ghost Batch Norm is a technique designed to address the "generalization gap" whereby the training process breaks down with very large batch sizes. If you are using a large batch size (typically in the thousands) to maximize GPU utilization, but the model is not converging well, enabling ghost batch norm can be a useful technique to improve convergence.

When using ghost batch norm, you specify a virtual_batch_size (default 128) representing the "ideal" batch size to train with (ignoring throughput or GPU utilization). The ghost batch norm will then subdivide each batch into subbatches of size virtual_batch_size and apply batch normalization to each.

A notable downside to ghost batch norm is that it is more computationally expensive than traditional batch norm, so it is only recommended to use it when the batch size that maximizes throughput is significantly higher than the batch size that yields the best convergence (one or more orders of magnitude higher).

The approach was introduced in Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks and since popularized by its use in TabNet.

norm: ghost
norm_params:
  virtual_batch_size: 128
  epsilon: 0.001
  momentum: 0.05

Parameters:

  • virtual_batch_size (default: 128): Size of the virtual batch size used by ghost batch norm. If null, regular batch norm is used instead. B_v from the TabNet paper.
  • epsilon (default: 0.001): Epsilon to be added to the batch norm denominator.
  • momentum (default: 0.05): Momentum of the batch norm. 1 - m_B from the TabNet paper.