Combiner
Combiners take the outputs of all input features encoders and combine them before providing the combined representation to the output feature decoders.
You can specify which one to use in the combiner
section of the configuration, and if you don't specify a combiner,
the concat
combiner will be used.
Concat Combiner¶
The concat
combiner assumes all outputs from encoders are tensors of size b x h
where b
is the batch size and h
is the hidden dimension, which can be different for each input.
If any inputs have more than 2 dimensions, a sequence or set feature for example, set the flatten_inputs
parameter to true
.
It concatenates along the h
dimension, and then (optionally) passes the concatenated tensor through a stack of fully connected layers.
It returns the final b x h'
tensor where h'
is the size of the last fully connected layer or the sum of the sizes of
the h
of all inputs in the case there are no fully connected layers.
If only a single input feature and no fully connected layer is specified, the output of the input feature encoder is
passed through the combiner unchanged.
+-----------+
|Input |
|Feature 1 +-+
+-----------+ | +---------+
+-----------+ | +------+ |Fully |
|... +--->Concat+--->Connected+->
+-----------+ | +------+ |Layers |
+-----------+ | +---------+
|Input +-+
|Feature N |
+-----------+
These are the available parameters of a concat
combiner:
fc_layers
(defaultnull
): it is a list of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one specified as a parameter of the decoder will be used instead.num_fc_layers
(default 0): this is the number of stacked fully connected layers that the input to the feature passes through. Their output is projected in the feature's output space.output_size
(default256
): if anoutput_size
is not already specified infc_layers
this is the defaultoutput_size
that will be used for each layer. It indicates the size of the output of a fully connected layer.use_bias
(defaulttrue
): boolean, whether the layer uses a bias vector.weights_initializer
(default'glorot_uniform'
): initializer for the weight matrix. Options are:constant
,identity
,zeros
,ones
,orthogonal
,normal
,uniform
,truncated_normal
,variance_scaling
,glorot_normal
,glorot_uniform
,xavier_normal
,xavier_uniform
,he_normal
,he_uniform
,lecun_normal
,lecun_uniform
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init.bias_initializer
(default'zeros'
): initializer for the bias vector. Options are:constant
,identity
,zeros
,ones
,orthogonal
,normal
,uniform
,truncated_normal
,variance_scaling
,glorot_normal
,glorot_uniform
,xavier_normal
,xavier_uniform
,he_normal
,he_uniform
,lecun_normal
,lecun_uniform
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. For a description of the parameters of each initializer, see torch.nn.init.norm
(defaultnull
): if anorm
is not already specified infc_layers
this is the defaultnorm
that will be used for each layer. It indicates the norm of the output and it can benull
,batch
orlayer
.norm_params
(defaultnull
): parameters used ifnorm
is eitherbatch
orlayer
. For information on parameters used withbatch
see the Torch documentation on batch normalization or forlayer
see the Torch documentation on layer normalization.activation
(defaultrelu
): if anactivation
is not already specified infc_layers
this is the defaultactivation
that will be used for each layer. It indicates the activation function applied to the output.dropout
(default0
): dropout rate.flatten_inputs
(defaultfalse
): iftrue
flatten the tensors from all the input features into a vector.residual
(defaultfalse
): iftrue
adds a residual connection to each fully connected layer block. It is required that all fully connected layers have the same size for this parameter to work correctly.
Example configuration of a concat
combiner:
type: concat
fc_layers: null
num_fc_layers: 0
output_size: 256
use_bias: true
weights_initializer: 'glorot_uniform'
bias_initializer: 'zeros'
norm: null
norm_params: null
activation: relu
dropout: 0
flatten_inputs: false
residual: false
Sequence Concat Combiner¶
The sequence_concat
combiner assumes at least one output from encoders is a tensors of size b x s x h
where b
is
the batch size, s
is the length of the sequence and h
is the hidden dimension.
A sequence-like (sequence, text or time series) input feature can be specified with the main_sequence_feature
parameter which takes the name of sequence-like input feature as its value.
If no main_sequence_feature
is specified, the combiner will look through all the features in the order they are
defined in the configuration and will look for a feature with a rank 3 tensor output (sequence, text or time series).
If it cannot find one it will raise an exception, otherwise the output of that feature will be used for concatenating
the other features along the sequence s
dimension.
If there are other input features with a rank 3 output tensor, the combiner will concatenate them alongside the s
dimension, which means that all of them must have identical s
dimension, otherwise a dimension mismatch error will be
returned thrown during training when a datapoint with two sequential features of different lengths are provided.
Other features that have a b x h
rank 2 tensor output will be replicated s
times and concatenated to the s
dimension.
The final output is a b x s x h'
tensor where h'
is the size of the concatenation of the h
dimensions of all input features.
Sequence
Feature
Output
+---------+
|emb seq 1|
+---------+
|... +--+
+---------+ | +-----------------+
|emb seq n| | |emb seq 1|emb oth| +------+
+---------+ | +-----------------+ | |
+-->... |... +-->+Reduce+->
Other | +-----------------+ | |
Feature | |emb seq n|emb oth| +------+
Output | +-----------------+
|
+-------+ |
|emb oth+----+
+-------+
These are the available parameters of a sequence_concat
combiner:
main_sequence_feature
(defaultnull
): name of a sequence, text, or time series feature to concatenate the outputs of the other features to. If nomain_sequence_feature
is specified, the combiner will look through all the features in the order they are defined in the configuration and will look for a feature with a rank 3 tensor output (sequence, text or time series). If it cannot find one it will raise an exception, otherwise the output of that feature will be used for concatenating the other features along the sequences
dimension. If there are other input features with a rank 3 output tensor, the combiner will concatenate them alongside thes
dimension. All sequence-like input features must have identicals
dimension, otherwise an error will be thrown.reduce_output
(defaultnull
): describes the strategy to use to aggregate the embeddings of the items of the set. Possible values arenull
,sum
,mean
andsqrt
(the weighted sum divided by the square root of the sum of the squares of the weights).
Example configuration of a sequence_concat
combiner:
type: sequence_concat
main_sequence_feature: null
reduce_output: null
Sequence Combiner¶
The sequence
combiner stacks a sequence concat combiner with a sequence encoder.
All the considerations about input tensor ranks described for the sequence concat combiner
apply also in this case, but the main difference is that this combiner uses the b x s x h'
output of the sequence
concat combiner, where b
is the batch size, s
is the sequence length and h'
is the sum of the hidden dimensions of
all input features, as input for any of the sequence encoders described in the sequence features encoders section.
Refer to that section for more detailed information about the sequence encoders and their parameters.
All considerations on the shape of the outputs for the sequence encoders also apply to sequence combiner.
Sequence
Feature
Output
+---------+
|emb seq 1|
+---------+
|... +--+
+---------+ | +-----------------+
|emb seq n| | |emb seq 1|emb oth| +--------+
+---------+ | +-----------------+ |Sequence|
+-->... |... +-->+Encoder +->
Other | +-----------------+ | |
Feature | |emb seq n|emb oth| +--------+
Output | +-----------------+
|
+-------+ |
|emb oth+----+
+-------+
Example configuration of a sequence
combiner:
type: sequence
main_sequence_feature: null
encoder: parallel_cnn
... encoder parameters ...
TabNet Combiner¶
The tabnet
combiner implements the TabNet model, which uses attention and sparsity
to achieve high performance on tabular data. It assumes all outputs from encoders are tensors of size b x h
where b
is the batch size and h
is the hidden dimension, which can be different for each input.
If the input tensors have a different shape, it automatically flattens them.
It returns the final b x h'
tensor where h'
is the user-specified output size.
+-----------+
|Input |
|Feature 1 +-+
+-----------+ |
+-----------+ | +------+
|... +--->TabNet+-->
+-----------+ | +------+
+-----------+ |
|Input +-+
|Feature N |
+-----------+
These are the available parameters of a tabnet
combiner:
size
: the size of the hidden layers.N_a
in the paper.output_size
: the size of the output of each step and of the final aggregated representation.N_d
in the paper.num_steps
(default1
): number of steps / repetitions of the attentive transformer and feature transformer computations.N_steps
in the paper.num_total_blocks
(default4
): total number of feature transformer blocks at each step.num_shared_blocks
(default2
): number of shared feature transformer blocks across the steps.relaxation_factor
(default1.5
): Factor that influences how many times a feature should be used across the steps of computation. a value of1
implies it each feature should be use once, a higher value allows for multiple usages.gamma
in the paper.bn_epsilon
(default0.001
): epsilon to be added to the batch norm denominator.bn_momentum
(default0.7
): momentum of the batch norm.m_B
in the paper.bn_virtual_bs
(defaultnull
): size of the virtual batch size used by ghost batch norm. Ifnull
, regular batch norm is used instead.B_v
from the paper.sparsity
(default0.00001
): multiplier of the sparsity inducing loss.lambda_sparse
in the paper.dropout
(default0
): dropout rate.
Example configuration of a tabnet
combiner:
type: tabnet
size: 32
ooutput_size: 32
num_steps: 5
num_total_blocks: 4
num_shared_blocks: 2
relaxation_factor: 1.5
bn_epsilon: 0.001
bn_momentum: 0.7
bn_virtual_bs: 128
sparsity: 0.00001
dropout: 0
Transformer Combiner¶
The transformer
combiner combines input features using a stack of Transformer blocks (from Attention Is All You Need).
It assumes all outputs from encoders are tensors of size b x h
where b
is the batch size and h
is the hidden
dimension, which can be different for each input.
If the input tensors have a different shape, it automatically flattens them.
It then projects each input tensor to the same hidden / embedding size and encodes them with a stack of Transformer layers.
Finally, the transformer combiner applies a reduction to the outputs of the Transformer stack, followed by optional
fully connected layers.
The output is a b x h'
tensor where h'
is the size of the last fully connected layer or the hidden / embedding
size, or a b x n x h'
where n
is the number of input features and h'
is the hidden / embedding size if no reduction
is applied.
+-----------+
|Input |
|Feature 1 +-+
+-----------+ |
+-----------+ | +------------+ +------+ +----------+
|... +--->|Transformer +-->|Reduce+-->|Fully +->
| | | |Stack | +------+ |Connected |
+-----------+ | +------------+ |Layers |
+-----------+ | +----------+
|Input +-+
|Feature N |
+-----------+
These are the available parameters of a transformer
combiner:
num_layers
(default1
): number of layers in the stack of transformer blocks.hidden_size
(default256
): hidden / embedding size of each transformer block.num_heads
(default8
): number of attention heads of each transformer block.transformer_output_size
(default256
): size of the fully connected layers inside each transformer block.dropout
(default0
): dropout rate after the transformer.fc_layers
(defaultnull
): is a list of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one specified as a parameter of the decoder will be used instead.num_fc_layers
(default 0): this is the number of stacked fully connected layers to apply after reduction of the transformer output sequence.output_size
(default256
): if anoutput_size
is not already specified infc_layers
this is the defaultoutput_size
that will be used for each layer. It indicates the size of the output of a fully connected layer.use_bias
(defaulttrue
): boolean, whether the layer uses a bias vector.weights_initializer
(default'glorot_uniform'
): initializer for the weight matrix. Options are:constant
,identity
,zeros
,ones
,orthogonal
,normal
,uniform
,truncated_normal
,variance_scaling
,glorot_normal
,glorot_uniform
,xavier_normal
,xavier_uniform
,he_normal
,he_uniform
,lecun_normal
,lecun_uniform
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. To know the parameters of each initializer, please refer to torch.nn.init.bias_initializer
(default'zeros'
): initializer for the bias vector. Options are:constant
,identity
,zeros
,ones
,orthogonal
,normal
,uniform
,truncated_normal
,variance_scaling
,glorot_normal
,glorot_uniform
,xavier_normal
,xavier_uniform
,he_normal
,he_uniform
,lecun_normal
,lecun_uniform
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. To know the parameters of each initializer, please refer to torch.nn.init.norm
(defaultnull
): if anorm
is not already specified infc_layers
this is the defaultnorm
that will be used for each layer. It indicates the norm of the output and it can benull
,batch
orlayer
.norm_params
(defaultnull
): parameters used ifnorm
is eitherbatch
orlayer
. For information on parameters used withbatch
see the Torch documentation on batch normalization or forlayer
see the Torch documentation on layer normalization.activation
(defaultrelu
): if anactivation
is not already specified infc_layers
this is the defaultactivation
that will be used for each layer. It indicates the activation function applied to the output.fc_dropout
(default0
): dropout rate for the fully connected layers.fc_residual
(defaultfalse
): iftrue
adds a residual connection to each fully connected layer block. It is required that all fully connected layers have the same size for this parameter to work correctly.reduce_output
(defaultmean
): describes the strategy to use to aggregate the output of the transformer. Possible values includelast
,sum
,mean
,concat
, ornone
.
Example configuration of a transformer
combiner:
type: transformer
num_layers: 1
hidden_size: 256
num_heads: 8
transformer_output_size: 256
dropout: 0.1
fc_layers: null
num_fc_layers: 0
output_size: 256
use_bias: True
weights_initializer: glorot_uniform
bias_initializer: zeros
norm: null
norm_params: null
fc_activation: relu
fc_dropout: 0
fc_residual: null
reduce_output: mean
Resources to learn more about transformers:
- CS480/680 Lecture 19: Attention and Transformer Networks (VIDEO)
- Attention is all you need - Attentional Neural Network Models Masterclass (VIDEO)
- Illustrated: Self-Attention (Colab notebook)
TabTransformer Combiner¶
The tabtransformer
combiner combines input features in the following sequence of operations. Except for binary and number features, the combiner projects features to an embedding size. These features are concatenated as if they were a sequence and passed through a transformer. After the transformer, the number and binary features are concatenated (which are of size 1) and then concatenated with the output of the transformer and is passed to a stack of fully connected layers (from TabTransformer: Tabular Data Modeling Using Contextual Embeddings).
It assumes all outputs from encoders are tensors of size b x h
where b
is the batch size and h
is the hidden
dimension, which can be different for each input.
If the input tensors have a different shape, it automatically flattens them.
It then projects each input tensor to the same hidden / embedding size and encodes them with a stack of Transformer layers.
Finally, the transformer combiner applies a reduction to the outputs of the Transformer stack, followed by the above concatenation and optional fully connected layers.
The output is a b x h'
tensor where h'
is the size of the last fully connected layer or the hidden / embedding
size, or a b x n x h'
where n
is the number of input features and h'
is the hidden / embedding size if no reduction
is applied.
+-----------+
|Input |
|Feature 1 +-+
+-----------+ |
+-----------+ | +-------------+ +--------------+ +------ + +----------+ +----------+
| +--->| Categoricial+->|TabTransformer +-->|Reduce +-> | Combined +->|Fully +->
| | | | Embeddings | |Stack | +-------+ | Hidden | |Connected |
| | | +-------------+ +---------------+ | Layers | |Layers |
|... | | +----------+ +----------+
| | | +-----------+ ^
| | | | Binary & | |
+-----------+ |------------------------->| Numerical |------------------
+-----------+ | | Encodings |
|Input +-+ +-----------+
|Feature N |
+-----------+
These are the available parameters of a transformer
combiner:
num_layers
(default1
): number of layers in the stack of transformer blocks.hidden_size
(default256
): hidden / embedding size of each transformer block.num_heads
(default8
): number of attention heads of each transformer block.transformer_output_size
(default256
): size of the fully connected layers inside each transformer block.dropout
(default0
): dropout rate after the transformer.fc_layers
(defaultnull
): is a list of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one specified as a parameter of the decoder will be used instead.num_fc_layers
(default 0): this is the number of stacked fully connected layers to apply after reduction of the transformer output sequence.output_size
(default256
): if anoutput_size
is not already specified infc_layers
this is the defaultoutput_size
that will be used for each layer. It indicates the size of the output of a fully connected layer.use_bias
(defaulttrue
): boolean, whether the layer uses a bias vector.weights_initializer
(default'glorot_uniform'
): initializer for the weight matrix. Options are:constant
,identity
,zeros
,ones
,orthogonal
,normal
,uniform
,truncated_normal
,variance_scaling
,glorot_normal
,glorot_uniform
,xavier_normal
,xavier_uniform
,he_normal
,he_uniform
,lecun_normal
,lecun_uniform
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. To know the parameters of each initializer, please refer to torch.nn.init.bias_initializer
(default'zeros'
): initializer for the bias vector. Options are:constant
,identity
,zeros
,ones
,orthogonal
,normal
,uniform
,truncated_normal
,variance_scaling
,glorot_normal
,glorot_uniform
,xavier_normal
,xavier_uniform
,he_normal
,he_uniform
,lecun_normal
,lecun_uniform
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. To know the parameters of each initializer, please refer to torch.nn.init.norm
(defaultnull
): if anorm
is not already specified infc_layers
this is the defaultnorm
that will be used for each layer. It indicates the norm of the output and it can benull
,batch
orlayer
.norm_params
(defaultnull
): parameters used ifnorm
is eitherbatch
orlayer
. For information on parameters used withbatch
see the Torch documentation on batch normalization or forlayer
see the Torch documentation on layer normalization.activation
(defaultrelu
): if anactivation
is not already specified infc_layers
this is the defaultactivation
that will be used for each layer. It indicates the activation function applied to the output.fc_dropout
(default0
): dropout rate for the fully connected layers.fc_residual
(defaultfalse
): iftrue
adds a residual connection to each fully connected layer block. It is required that all fully connected layers have the same size for this parameter to work correctly.reduce_output
(defaultmean
): describes the strategy to use to aggregate the output of the transformer. Possible values includelast
,sum
,mean
,concat
, ornone
.embed_input_feature_name
(defaultnull
) controls the size of the embeddings. Valid values areadd
which uses thehidden_size
value or an integer to set a specific value. In the case of an integer value, it must be smaller thanhidden_size
.
Example configuration of a tabtransformer
combiner:
type: tabtransformer
num_layers: 1
hidden_size: 256
num_heads: 8
transformer_output_size: 256
dropout: 0.1
fc_layers: null
num_fc_layers: 0
output_size: 256
use_bias: True
weights_initializer: glorot_uniform
bias_initializer: zeros
norm: null
norm_params: null
fc_activation: relu
fc_dropout: 0
fc_residual: null
reduce_output: mean
embed_input_fature_name: null
Comparator Combiner¶
The comparator
combiner compares the hidden representation of two entities defined by lists of features.
It assumes all outputs from encoders are tensors of size b x h
where b
is the batch size and h
is the hidden
dimension, which can be different for each input.
If the input tensors have a different shape, it automatically flattens them.
It then concatenates the representations of each entity and projects them both to vectors of size output_size
.
Finally, it compares the two entity representations by dot product, element-wise multiplication, absolute difference and bilinear product.
It returns the final b x h'
tensor where h'
is the size of the concatenation of the four comparisons.
+-----------+
|Entity 1 |
|Input |
|Feature 1 +-+
+-----------+ |
+-----------+ | +-------+ +----------+
|... +--->|Concat +-->|FC Layers +--+
| | | +-------+ +----------+ |
+-----------+ | |
+-----------+ | |
|Entity 1 +-+ |
|Input | |
|Feature N | |
+-----------+ | +---------+
+-->| Compare +->
+-----------+ | +---------+
|Entity 2 | |
|Input | |
|Feature 1 +-+ |
+-----------+ | |
+-----------+ | +-------+ +----------+ |
|... +--->|Concat +-->|FC Layers +--+
| | | +-------+ +----------+
+-----------+ |
+-----------+ |
|Entity 2 +-+
|Input |
|Feature N |
+-----------+
These are the available parameters of a comparator
combiner:
entity_1
: list of input features that compose the first entity to compare.entity_2
: list of input features that compose the second entity to compare.fc_layers
(defaultnull
): is a list of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one specified as a parameter of the decoder will be used instead.num_fc_layers
(default 0): this is the number of stacked fully connected layers that the input to the feature passes through. Their output is projected in the feature's output space.output_size
(default256
): ifoutput_size
is not already specified infc_layers
this is the defaultoutput_size
that will be used for each layer. It indicates the size of the output of a fully connected layer.use_bias
(defaulttrue
): boolean, whether the layer uses a bias vector.weights_initializer
(default'glorot_uniform'
): initializer for the weight matrix. Options are:constant
,identity
,zeros
,ones
,orthogonal
,normal
,uniform
,truncated_normal
,variance_scaling
,glorot_normal
,glorot_uniform
,xavier_normal
,xavier_uniform
,he_normal
,he_uniform
,lecun_normal
,lecun_uniform
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. To know the parameters of each initializer, please refer to torch.nn.init.bias_initializer
(default'zeros'
): initializer for the bias vector. Options are:constant
,identity
,zeros
,ones
,orthogonal
,normal
,uniform
,truncated_normal
,variance_scaling
,glorot_normal
,glorot_uniform
,xavier_normal
,xavier_uniform
,he_normal
,he_uniform
,lecun_normal
,lecun_uniform
. Alternatively it is possible to specify a dictionary with a keytype
that identifies the type of initializer and other keys for its parameters, e.g.{type: normal, mean: 0, stddev: 0}
. To know the parameters of each initializer, please refer to torch.nn.init.norm
(defaultnull
): if anorm
is not already specified infc_layers
this is the defaultnorm
that will be used for each layer. It indicates the norm of the output and it can benull
,batch
orlayer
.norm_params
(defaultnull
): parameters used ifnorm
is eitherbatch
orlayer
. For information on parameters used withbatch
see the Torch documentation on batch normalization or forlayer
see the Torch documentation on layer normalization.activation
(defaultrelu
): if anactivation
is not already specified infc_layers
this is the defaultactivation
that will be used for each layer. It indicates the activation function applied to the output.dropout
(default0
): dropout rate for the fully connected layers.
Example configuration of a comparator
combiner:
type: comparator
entity_1: [feature_1, feature_2]
entity_2: [feature_3, feature_4]
fc_layers: null
num_fc_layers: 0
output_size: 256
use_bias: true
weights_initializer: 'glorot_uniform'
bias_initializer: 'zeros'
norm: null
norm_params: null
activation: relu
dropout: 0