# ⇅ Category Features

## Preprocessing¶

Category features are transformed into integer valued vectors of size `n`

(where `n`

is the size of the dataset) and
added to the HDF5 with a key that reflects the name of column in the dataset.
Categories are mapped to integers by first collecting a dictionary of all unique category strings present in the column
of the dataset, ranking them descending by frequency and assigning a sequential integer ID from the most frequent to the
most rare (with 0 assigned to the special unknown placeholder token `<UNK>`

).
The column name is added to the JSON file, with an associated dictionary containing

- the mapping from integer to string (
`idx2str`

) - the mapping from string to id (
`str2idx`

) - the mapping from string to frequency (
`str2freq`

) - the size of the set of all tokens (
`vocab_size`

) - additional preprocessing information (by default how to fill missing values and what token to use to fill missing values)

```
preprocessing:
missing_value_strategy: fill_with_const
fill_value: <UNK>
lowercase: false
most_common: 10000
cache_encoder_embeddings: false
```

Parameters:

(default:`missing_value_strategy`

`fill_with_const`

) : What strategy to follow when there's a missing value in a category column Options:`fill_with_const`

,`fill_with_mode`

,`bfill`

,`ffill`

,`drop_row`

. See Missing Value Strategy for details.(default:`fill_value`

`<UNK>`

): The value to replace missing values with in case the`missing_value_strategy`

is`fill_with_const`

(default:`lowercase`

`false`

): Whether the string has to be lowercased before being handled by the tokenizer. Options:`true`

,`false`

.(default:`most_common`

`10000`

): The maximum number of most common tokens to be considered. if the data contains more than this amount, the most infrequent tokens will be treated as unknown.(default:`cache_encoder_embeddings`

`false`

): For fixed encoders, compute encoder embeddings in preprocessing to avoid this step at train time. Can speed up the time taken per step during training, but will invalidate the preprocessed data if the encoder type is changed. Some model types (GBM) require caching encoder embeddings to use embedding features, and those models will override this value to`true`

automatically. Options:`true`

,`false`

.

Preprocessing parameters can also be defined once and applied to all category input features using the Type-Global Preprocessing section.

## Input Features¶

Category features have three encoders.
The `passthrough`

encoder passes the raw integer values coming from the input placeholders to outputs of size `b x 1`

.
The other two encoders map to either `dense`

or `sparse`

embeddings (one-hot encodings) and returned as outputs of size
`b x h`

, where `b`

is the batch size and `h`

is the dimensionality of the embeddings.

The encoder parameters specified at the feature level are:

(default`tied`

`null`

): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.

Example category feature entry in the input features list:

```
name: category_column_name
type: category
tied: null
encoder:
type: dense
```

The available encoder parameters are:

(default`type`

`dense`

): the possible values are`passthrough`

,`dense`

and`sparse`

.`passthrough`

outputs the raw integer values unaltered.`dense`

randomly initializes a trainable embedding matrix,`sparse`

uses one-hot encoding.

Encoder type and encoder parameters can also be defined once and applied to all category input features using the Type-Global Encoder section.

### Encoders¶

#### Dense Encoder¶

```
encoder:
type: dense
dropout: 0.0
embedding_size: 50
embedding_initializer: null
embeddings_on_cpu: false
embeddings_trainable: true
pretrained_embeddings: null
```

Parameters:

(default:`dropout`

`0.0`

) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`embedding_size`

`50`

) : The maximum embedding size, the actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of different strings appearing in the training set in the column the feature is named after (plus 1 for). (default:`embedding_initializer`

`null`

): Initializer for the embedding matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

,`null`

.(default:`embeddings_on_cpu`

`false`

): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:`true`

,`false`

.-
(default:`embeddings_trainable`

`true`

): If`true`

embeddings are trained during the training process, if`false`

embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when`representation`

is`dense`

;`sparse`

one-hot encodings are not trainable. Options:`true`

,`false`

. -
(default:`pretrained_embeddings`

`null`

): Path to a file containing pretrained embeddings. By default`dense`

embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if`representation`

is`dense`

.

#### Sparse Encoder¶

```
encoder:
type: sparse
dropout: 0.0
embedding_initializer: null
embeddings_on_cpu: false
embeddings_trainable: false
pretrained_embeddings: null
```

Parameters:

(default:`dropout`

`0.0`

) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`embedding_initializer`

`null`

): Initializer for the embedding matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

,`null`

.(default:`embeddings_on_cpu`

`false`

): Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:`true`

,`false`

.-
(default:`embeddings_trainable`

`false`

): If`true`

embeddings are trained during the training process, if`false`

embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter has effect only when`representation`

is`dense`

;`sparse`

one-hot encodings are not trainable. Options:`true`

,`false`

. -
(default:`pretrained_embeddings`

`null`

): Path to a file containing pretrained embeddings. By default`dense`

embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if`representation`

is`dense`

.

## Output Features¶

```
graph LR
A["Combiner\n Output"] --> B["Fully\n Connected\n Layers"];
B --> C["Projection into\n Output Space"];
C --> D["Softmax"];
subgraph DEC["DECODER.."]
B
C
D
end
```

Category features can be used when a multi-class classification needs to be performed. There is only one decoder available for category features: a (potentially empty) stack of fully connected layers, followed by a projection into a vector of size of the number of available classes, followed by a softmax.

Example category output feature using default parameters:

```
name: category_column_name
type: category
reduce_input: sum
dependencies: []
calibration: false
reduce_dependencies: sum
loss:
type: softmax_cross_entropy
confidence_penalty: 0
robust_lambda: 0
class_weights: null
class_similarities: null
class_similarities_temperature: 0
decoder:
type: classifier
```

Parameters:

(default`reduce_input`

`sum`

): defines how to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:`sum`

,`mean`

or`avg`

,`max`

,`concat`

(concatenates along the first dimension),`last`

(returns the last vector of the first dimension).(default`calibration`

`false`

): if true, performs calibration by temperature scaling after training is complete. Calibration uses the validation set to find a scale factor (temperature) which is multiplied with the logits to shift output probabilities closer to true likelihoods.(default`dependencies`

`[]`

): the output features this one is dependent on. For a detailed explanation refer to Output Features Dependencies.(default`reduce_dependencies`

`sum`

): defines how to reduce the output of a dependent feature that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:`sum`

,`mean`

or`avg`

,`max`

,`concat`

(concatenates along the first dimension),`last`

(returns the last vector of the first dimension).(default`loss`

`{type: softmax_cross_entropy}`

): is a dictionary containing a loss`type`

.`softmax_cross_entropy`

is the only supported loss type for category output features. See Loss for details.(default`top_k`

`3`

): determines the parameter`k`

, the number of categories to consider when computing the`top_k`

measure. It computes accuracy but considering as a match if the true category appears in the first`k`

predicted categories ranked by decoder's confidence.(default:`decoder`

`{"type": "classifier"}`

): Decoder for the desired task. Options:`classifier`

. See Decoder for details.

Decoder type and decoder parameters can also be defined once and applied to all category output features using the Type-Global Decoder section.

### Decoders¶

#### Classifier¶

```
decoder:
type: classifier
num_fc_layers: 0
fc_output_size: 256
fc_norm: null
fc_dropout: 0.0
fc_activation: relu
fc_layers: null
fc_use_bias: true
fc_weights_initializer: xavier_uniform
fc_bias_initializer: zeros
fc_norm_params: null
use_bias: true
weights_initializer: xavier_uniform
bias_initializer: zeros
```

Parameters:

(default:`num_fc_layers`

`0`

) : Number of fully-connected layers if`fc_layers`

not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.(default:`fc_output_size`

`256`

) : Output size of fully connected stack.(default:`fc_norm`

`null`

) : Default normalization applied at the beginnging of fully connected layers. Options:`batch`

,`layer`

,`ghost`

,`null`

. See Normalization for details.(default:`fc_dropout`

`0.0`

) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`fc_activation`

`relu`

): Default activation function applied to the output of the fully connected layers. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`fc_layers`

`null`

): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:`activation`

,`dropout`

,`norm`

,`norm_params`

,`output_size`

,`use_bias`

,`bias_initializer`

and`weights_initializer`

. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.(default:`fc_use_bias`

`true`

): Whether the layer uses a bias vector in the fc_stack. Options:`true`

,`false`

.(default:`fc_weights_initializer`

`xavier_uniform`

): The weights initializer to use for the layers in the fc_stack(default:`fc_bias_initializer`

`zeros`

): The bias initializer to use for the layers in the fc_stack(default:`fc_norm_params`

`null`

): Default parameters passed to the`norm`

module.(default:`use_bias`

`true`

): Whether the layer uses a bias vector. Options:`true`

,`false`

.(default:`weights_initializer`

`xavier_uniform`

): Initializer for the weight matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

.(default:`bias_initializer`

`zeros`

): Initializer for the bias vector. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

.

### Loss¶

#### Softmax Cross Entropy¶

```
loss:
type: softmax_cross_entropy
class_weights: null
weight: 1.0
robust_lambda: 0
confidence_penalty: 0
class_similarities: null
class_similarities_temperature: 0
```

Parameters:

(default:`class_weights`

`null`

) : Weights to apply to each class in the loss. If not specified, all classes are weighted equally. The value can be a vector of weights, one for each class, that is multiplied to the loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in case of unbalanced class distribution. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the`<UNK>`

class needs to be included too). Alternatively, the value can be a dictionary with class strings as keys and weights as values, like`{class_a: 0.5, class_b: 0.7, ...}`

.(default:`weight`

`1.0`

): Weight of the loss.(default:`robust_lambda`

`0`

): Replaces the loss with`(1 - robust_lambda) * loss + robust_lambda / c`

where`c`

is the number of classes. Useful in case of noisy labels.(default:`confidence_penalty`

`0`

): Penalizes overconfident predictions (low entropy) by adding an additional term that penalizes too confident predictions by adding a`a * (max_entropy - entropy) / max_entropy`

term to the loss, where a is the value of this parameter. Useful in case of noisy labels.(default:`class_similarities`

`null`

): If not`null`

it is a`c x c`

matrix in the form of a list of lists that contains the mutual similarity of classes. It is used if`class_similarities_temperature`

is greater than 0. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the`<UNK>`

class needs to be included too).(default:`class_similarities_temperature`

`0`

): The temperature parameter of the softmax that is performed on each row of`class_similarities`

. The output of that softmax is used to determine the supervision vector to provide instead of the one hot vector that would be provided otherwise for each datapoint. The intuition behind it is that errors between similar classes are more tolerable than errors between really different classes.

Loss and loss related parameters can also be defined once and applied to all category output features using the Type-Global Loss section.

### Metrics¶

The measures that are calculated every epoch and are available for category features are `accuracy`

, `hits_at_k`

(computes accuracy considering as a match if the true category appears in the first `k`

predicted categories ranked by
decoder's confidence) and the `loss`

itself.
You can set either of them as `validation_metric`

in the `training`

section of the configuration if you set the
`validation_field`

to be the name of a category feature.