# ⇅ Set Features

## Preprocessing¶

Set features are expected to be provided as a string of elements separated by whitespace, e.g. "elem5 elem9 elem6".
The string values are transformed into a binary (int8 actually) valued matrix of size `n x l`

(where `n`

is the number
of rows in the dataset and `l`

is the minimum of the size of the biggest set and a `max_size`

parameter) and added to
HDF5 with a key that reflects the name of column in the dataset.
The way sets are mapped into integers consists in first using a tokenizer to map each input string to a sequence of set
elements (by default this is done by splitting on spaces).
Next a dictionary is constructed which maps each unique element to its frequency in the dataset column. Elements are
ranked by frequency and a sequential integer ID is assigned in ascending order from the most frequent to the most rare.
The column name is added to the JSON file, with an associated dictionary containing

- the mapping from integer to string (
`idx2str`

) - the mapping from string to id (
`str2idx`

) - the mapping from string to frequency (
`str2freq`

) - the maximum size of all sets (
`max_set_size`

) - additional preprocessing information (by default how to fill missing values and what token to use to fill missing values)

```
preprocessing:
tokenizer: space
missing_value_strategy: fill_with_const
lowercase: false
most_common: 10000
fill_value: <UNK>
```

Parameters:

(default:`tokenizer`

`space`

) : Defines how to transform the raw text content of the dataset column to a set of elements. The default value space splits the string on spaces. Common options include: underscore (splits on underscore), comma (splits on comma), json (decodes the string into a set or a list through a JSON parser).(default:`missing_value_strategy`

`fill_with_const`

) : What strategy to follow when there's a missing value in a set column Options:`fill_with_const`

,`fill_with_mode`

,`bfill`

,`ffill`

,`drop_row`

. See Missing Value Strategy for details.(default:`lowercase`

`false`

): If true, converts the string to lowercase before tokenizing. Options:`true`

,`false`

.(default:`most_common`

`10000`

): The maximum number of most common tokens to be considered. If the data contains more than this amount, the most infrequent tokens will be treated as unknown.(default:`fill_value`

`<UNK>`

): The value to replace missing values with in case the missing_value_strategy is fill_with_const

Preprocessing parameters can also be defined once and applied to all set input features using the Type-Global Preprocessing section.

## Input Features¶

```
graph LR
A["0\n0\n1\n0\n1\n1\n0"] --> B["2\n4\n5"];
B --> C["emb 2\nemb 4\nemb 5"];
C --> D["Aggregation\n Reduce\n Operation"];
```

Set features have one encoder: `embed`

, the raw binary values coming from the input placeholders are first transformed to sparse
integer lists, then they are mapped to either dense or sparse embeddings (one-hot encodings), finally they are
reduced on the sequence dimension and returned as an aggregated embedding vector.
Inputs are of size `b`

while outputs are of size `b x h`

where `b`

is the batch size and `h`

is the dimensionality of
the embeddings.

The encoder parameters specified at the feature level are:

(default`tied`

`null`

): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.

```
name: set_column_name
type: set
tied: null
encoder:
type: embed
```

Encoder type and encoder parameters can also be defined once and applied to all set input features using the Type-Global Encoder section.

### Encoders¶

#### Embed Encoder¶

```
encoder:
type: embed
dropout: 0.0
embedding_size: 50
output_size: 10
activation: relu
norm: null
representation: dense
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
embeddings_on_cpu: false
embeddings_trainable: true
norm_params: null
num_fc_layers: 0
fc_layers: null
pretrained_embeddings: null
```

Parameters:

(default:`dropout`

`0.0`

) : Dropout probability for the embedding.(default:`embedding_size`

`50`

) : The maximum embedding size, the actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of different strings appearing in the training set in the input column (plus 1 for the unknown token placeholder). (default:`output_size`

`10`

) : If output_size is not already specified in fc_layers this is the default output_size that will be used for each layer. It indicates the size of the output of a fully connected layer.(default:`activation`

`relu`

): The default activation function that will be used for each layer. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`norm`

`null`

): The default norm that will be used for each layer. Options:`batch`

,`layer`

,`null`

. See Normalization for details.(default:`representation`

`dense`

): The representation of the embedding. Either dense or sparse. Options:`dense`

,`sparse`

.(default:`use_bias`

`true`

): Whether the layer uses a bias vector. Options:`true`

,`false`

.(default:`bias_initializer`

`zeros`

): Initializer to use for the bias vector. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

.(default:`weights_initializer`

`xavier_uniform`

): Initializer to use for the weights matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

.(default:`embeddings_on_cpu`

`false`

): By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:`true`

,`false`

.(default:`embeddings_trainable`

`true`

): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when representation is dense as sparse one-hot encodings are not trainable. Options:`true`

,`false`

.(default:`norm_params`

`null`

): Parameters used if norm is either`batch`

or`layer`

.(default:`num_fc_layers`

`0`

): This is the number of stacked fully connected layers that the input to the feature passes through. Their output is projected in the feature's output space.-
(default:`fc_layers`

`null`

): List of dictionaries containing the parameters for each fully connected layer. -
(default:`pretrained_embeddings`

`null`

): By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.

## Output Features¶

```
graph LR
A["Combiner\n Output"] --> B["Fully\n Connected\n Layers"];
B --> C["Projection into\n Output Space"];
C --> D["Sigmoid"];
subgraph DEC["DECODER.."]
B
C
D
end
```

Set features can be used when multi-label classification needs to be performed. There is only one decoder available for set features: a (potentially empty) stack of fully connected layers, followed by a projection into a vector of size of the number of available classes, followed by a sigmoid.

```
name: set_column_name
type: set
reduce_input: sum
dependencies: []
reduce_dependencies: sum
loss:
type: sigmoid_cross_entropy
decoder:
type: classifier
```

Parameters:

(default`reduce_input`

`sum`

): defines how to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:`sum`

,`mean`

or`avg`

,`max`

,`concat`

(concatenates along the first dimension).(default`dependencies`

`[]`

): the output features this one is dependent on. For a detailed explanation refer to Output Feature Dependencies.(default`reduce_dependencies`

`sum`

): defines how to reduce the output of a dependent feature that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:`sum`

,`mean`

or`avg`

,`max`

,`concat`

(concatenates along the first dimension),`last`

(returns the last vector of the first dimension).(default`loss`

`{type: sigmoid_cross_entropy}`

): is a dictionary containing a loss`type`

. The only supported loss`type`

for set features is`sigmoid_cross_entropy`

. See Loss for details.(default:`decoder`

`{"type": "classifier"}`

): Decoder for the desired task. Options:`classifier`

. See Decoder for details.

### Decoders¶

#### Classifier¶

```
decoder:
type: classifier
num_fc_layers: 0
fc_output_size: 256
fc_norm: null
fc_dropout: 0.0
fc_activation: relu
fc_layers: null
fc_use_bias: true
fc_weights_initializer: xavier_uniform
fc_bias_initializer: zeros
fc_norm_params: null
use_bias: true
weights_initializer: xavier_uniform
bias_initializer: zeros
```

Parameters:

(default:`num_fc_layers`

`0`

) : Number of fully-connected layers if`fc_layers`

not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.(default:`fc_output_size`

`256`

) : Output size of fully connected stack.(default:`fc_norm`

`null`

) : Default normalization applied at the beginnging of fully connected layers. Options:`batch`

,`layer`

,`ghost`

,`null`

.(default:`fc_dropout`

`0.0`

) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`fc_activation`

`relu`

): Default activation function applied to the output of the fully connected layers. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`fc_layers`

`null`

): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:`activation`

,`dropout`

,`norm`

,`norm_params`

,`output_size`

,`use_bias`

,`bias_initializer`

and`weights_initializer`

. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.(default:`fc_use_bias`

`true`

): Whether the layer uses a bias vector in the fc_stack. Options:`true`

,`false`

.(default:`fc_weights_initializer`

`xavier_uniform`

): The weights initializer to use for the layers in the fc_stack(default:`fc_bias_initializer`

`zeros`

): The bias initializer to use for the layers in the fc_stack(default:`fc_norm_params`

`null`

): Default parameters passed to the`norm`

module.(default:`use_bias`

`true`

): Whether the layer uses a bias vector. Options:`true`

,`false`

.(default:`weights_initializer`

`xavier_uniform`

): Initializer for the weight matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

.(default:`bias_initializer`

`zeros`

): Initializer for the bias vector. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

.

Decoder type and decoder parameters can also be defined once and applied to all set output features using the Type-Global Decoder section.

### Loss¶

#### Sigmoid Cross Entropy¶

```
loss:
type: sigmoid_cross_entropy
class_weights: null
weight: 1.0
```

Parameters:

(default:`class_weights`

`null`

) : Weights to apply to each class in the loss. If not specified, all classes are weighted equally. The value can be a vector of weights, one for each class, that is multiplied to the loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in case of unbalanced class distribution. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the`<UNK>`

class needs to be included too). Alternatively, the value can be a dictionary with class strings as keys and weights as values, like`{class_a: 0.5, class_b: 0.7, ...}`

.(default:`weight`

`1.0`

): Weight of the loss.

Loss type and loss related parameters can also be defined once and applied to all set output features using the Type-Global Loss section.

### Metrics¶

The metrics that are calculated every epoch and are available for set features are `jaccard`

(counts the number of
elements in the intersection of prediction and label divided by number of elements in the union) and the `loss`

itself.
You can set either of them as `validation_metric`

in the `training`

section of the configuration if you set the
`validation_field`

to be the name of a set feature.