# ⇅ Vector Features

Vector features enable providing an ordered set of numerical values within a single feature.

This is useful for providing pre-trained representations or activations obtained from other models or for providing multivariate inputs and outputs. An interesting use of vector features is the possibility of providing a probability distribution as output for a multiclass classification problem instead of a single correct class like with a category feature. Vector output features can also be useful for distillation and noise-aware losses.

## Preprocessing¶

The data is expected as whitespace separated numerical values. Example: "1.0 0.0 1.04 10.49". All vectors are expected to be of the same size.

```
preprocessing:
vector_size: null
missing_value_strategy: fill_with_const
fill_value: ''
```

Parameters:

(default:`vector_size`

`null`

) : The size of the vector. If None, the vector size will be inferred from the data.(default:`missing_value_strategy`

`fill_with_const`

) : What strategy to follow when there's a missing value in a vector column Options:`fill_with_const`

,`fill_with_mode`

,`bfill`

,`ffill`

,`drop_row`

. See Missing Value Strategy for details.(default: ``): The value to replace missing values with in case the missing_value_strategy is fill_with_const`fill_value`

Preprocessing parameters can also be defined once and applied to all vector input features using the Type-Global Preprocessing section.

## Input Features¶

The vector feature supports two encoders: `dense`

and `passthrough`

.

The encoder parameters specified at the feature level are:

(default`tied`

`null`

): name of the input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.

Example vector feature entry in the input features list:

```
name: vector_column_name
type: vector
tied: null
encoder:
type: dense
```

The available encoder parameters are:

(default`type`

`dense`

): the possible values are`passthrough`

and`dense`

.`passthrough`

outputs the raw vector values unaltered.`dense`

uses a stack of fully connected layers to create an embedding matrix.

Encoder type and encoder parameters can also be defined once and applied to all vector input features using the Type-Global Encoder section.

### Encoders¶

#### Passthrough Encoder¶

```
encoder:
type: passthrough
```

There are no additional parameters for `passthrough`

encoder.

#### Dense Encoder¶

For vector features, a dense encoder (stack of fully connected layers) can be used to encode the vector.

```
encoder:
type: dense
dropout: 0.0
output_size: 256
norm: null
num_layers: 1
activation: relu
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
norm_params: null
fc_layers: null
```

Parameters:

(default:`dropout`

`0.0`

) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`output_size`

`256`

) : Size of the output of the feature.(default:`norm`

`null`

) : Default normalization applied at the beginnging of fully connected layers. Options:`batch`

,`layer`

,`ghost`

,`null`

. See Normalization for details.(default:`num_layers`

`1`

) : Number of stacked fully connected layers to apply. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.(default:`activation`

`relu`

): Default activation function applied to the output of the fully connected layers. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`use_bias`

`true`

): Whether the layer uses a bias vector. Options:`true`

,`false`

.-
(default:`bias_initializer`

`zeros`

): Initializer for the bias vector. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`weights_initializer`

`xavier_uniform`

): Initializer for the weight matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

. Alternatively it is possible to specify a dictionary with a key`type`

that identifies the type of initializer and other keys for its parameters, e.g.`{type: normal, mean: 0, stddev: 0}`

. For a description of the parameters of each initializer, see torch.nn.init. -
(default:`norm_params`

`null`

): Default parameters passed to the`norm`

module. (default:`fc_layers`

`null`

): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:`activation`

,`dropout`

,`norm`

,`norm_params`

,`output_size`

,`use_bias`

,`bias_initializer`

and`weights_initializer`

. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.

## Output Features¶

```
graph LR
A["Combiner Output"] --> B["Fully\n Connected\n Layers"];
B --> C["Projection into\nVector Size"] --> D["Softmax"];
subgraph DEC["DECODER.."]
B
C
D
end
```

Vector features can be used when multi-class classification needs to be performed with a noise-aware loss or when the task is multivariate regression.

There is only one decoder available for vector features: a (potentially empty) stack of fully connected layers, followed by a projection into a tensor of the vector size (optionally followed by a softmax in the case of multi-class classification).

Example vector output feature using default parameters:

```
name: vector_column_name
type: vector
reduce_input: sum
dependencies: []
reduce_dependencies: sum
loss:
type: sigmoid_cross_entropy
decoder:
type: projector
```

Parameters:

(default`reduce_input`

`sum`

): defines how to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:`sum`

,`mean`

or`avg`

,`max`

,`concat`

(concatenates along the first dimension),`last`

(returns the last vector of the first dimension).`dependencies`

(default`[]`

): the output features this one is dependent on. For a detailed explanation refer to Output Features Dependencies.(default`reduce_dependencies`

`sum`

): defines how to reduce the output of a dependent feature that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:`sum`

,`mean`

or`avg`

,`max`

,`concat`

(concatenates along the first dimension),`last`

(returns the last vector of the first dimension).(default`softmax`

`false`

): determines if to apply a softmax at the end of the decoder. It is useful for predicting a vector of values that sum up to 1 and can be interpreted as probabilities.(default`loss`

`{type: mean_squared_error}`

): is a dictionary containing a loss`type`

. The available loss`type`

are`mean_squared_error`

,`mean_absolute_error`

and`softmax_cross_entropy`

(use it only if`softmax`

is`true`

). See Loss for details.(default:`decoder`

`{"type": "projector"}`

): Decoder for the desired task. Options:`projector`

. See Decoder for details.

### Decoders¶

#### Projector¶

```
decoder:
type: projector
num_fc_layers: 0
fc_output_size: 256
fc_norm: null
fc_dropout: 0.0
output_size: null
fc_activation: relu
activation: null
fc_layers: null
fc_use_bias: true
fc_weights_initializer: xavier_uniform
fc_bias_initializer: zeros
fc_norm_params: null
use_bias: true
weights_initializer: xavier_uniform
bias_initializer: zeros
clip: null
multiplier: 1.0
```

Parameters:

(default:`num_fc_layers`

`0`

) : Number of fully-connected layers if`fc_layers`

not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.(default:`fc_output_size`

`256`

) : Output size of fully connected stack.(default:`fc_norm`

`null`

) : Default normalization applied at the beginnging of fully connected layers. Options:`batch`

,`layer`

,`ghost`

,`null`

. See Normalization for details.(default:`fc_dropout`

`0.0`

) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).(default:`output_size`

`null`

) : Size of the output of the decoder.(default:`fc_activation`

`relu`

): Default activation function applied to the output of the fully connected layers. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`activation`

`null`

): Indicates the activation function applied to the output. Options:`elu`

,`leakyRelu`

,`logSigmoid`

,`relu`

,`sigmoid`

,`tanh`

,`softmax`

,`null`

.(default:`fc_layers`

`null`

): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:`activation`

,`dropout`

,`norm`

,`norm_params`

,`output_size`

,`use_bias`

,`bias_initializer`

and`weights_initializer`

. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.(default:`fc_use_bias`

`true`

): Whether the layer uses a bias vector in the fc_stack. Options:`true`

,`false`

.(default:`fc_weights_initializer`

`xavier_uniform`

): The weights initializer to use for the layers in the fc_stack(default:`fc_bias_initializer`

`zeros`

): The bias initializer to use for the layers in the fc_stack(default:`fc_norm_params`

`null`

): Default parameters passed to the`norm`

module.(default:`use_bias`

`true`

): Whether the layer uses a bias vector. Options:`true`

,`false`

.(default:`weights_initializer`

`xavier_uniform`

): Initializer for the weight matrix. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

.(default:`bias_initializer`

`zeros`

): Initializer for the bias vector. Options:`uniform`

,`normal`

,`constant`

,`ones`

,`zeros`

,`eye`

,`dirac`

,`xavier_uniform`

,`xavier_normal`

,`kaiming_uniform`

,`kaiming_normal`

,`orthogonal`

,`sparse`

,`identity`

.-
(default:`clip`

`null`

): Clip the output of the decoder to be within the given range. -
(default:`multiplier`

`1.0`

): Multiplier to scale the activated outputs by. Useful when setting`activation`

to something that outputs a value between [-1, 1] like tanh to re-scale values back to order of magnitude of the data you're trying to predict. A good rule of thumb in such cases is to pick a value like`x * (max - min)`

where x is a scalar in the range [1, 2]. For example, if you're trying to predict something like temperature, it might make sense to pick a multiplier on the order of`100`

.

Decoder type and decoder parameters can also be defined once and applied to all vector output features using the Type-Global Decoder section.

### Loss¶

#### Mean Squared Error (MSE)¶

```
loss:
type: mean_squared_error
weight: 1.0
```

Parameters:

(default:`weight`

`1.0`

): Weight of the loss.

#### Mean Absolute Error (MAE)¶

```
loss:
type: mean_absolute_error
weight: 1.0
```

Parameters:

(default:`weight`

`1.0`

): Weight of the loss.

#### Mean Absolute Percentage Error (MAPE)¶

```
loss:
type: mean_absolute_percentage_error
weight: 1.0
```

Parameters:

(default:`weight`

`1.0`

): Weight of the loss.

#### Softmax Cross Entropy¶

```
loss:
type: softmax_cross_entropy
class_weights: null
weight: 1.0
robust_lambda: 0
confidence_penalty: 0
class_similarities: null
class_similarities_temperature: 0
```

Parameters:

(default:`class_weights`

`null`

) : Weights to apply to each class in the loss. If not specified, all classes are weighted equally. The value can be a vector of weights, one for each class, that is multiplied to the loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in case of unbalanced class distribution. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the`<UNK>`

class needs to be included too). Alternatively, the value can be a dictionary with class strings as keys and weights as values, like`{class_a: 0.5, class_b: 0.7, ...}`

.(default:`weight`

`1.0`

): Weight of the loss.(default:`robust_lambda`

`0`

): Replaces the loss with`(1 - robust_lambda) * loss + robust_lambda / c`

where`c`

is the number of classes. Useful in case of noisy labels.(default:`confidence_penalty`

`0`

): Penalizes overconfident predictions (low entropy) by adding an additional term that penalizes too confident predictions by adding a`a * (max_entropy - entropy) / max_entropy`

term to the loss, where a is the value of this parameter. Useful in case of noisy labels.(default:`class_similarities`

`null`

): If not`null`

it is a`c x c`

matrix in the form of a list of lists that contains the mutual similarity of classes. It is used if`class_similarities_temperature`

is greater than 0. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the`<UNK>`

class needs to be included too).(default:`class_similarities_temperature`

`0`

): The temperature parameter of the softmax that is performed on each row of`class_similarities`

. The output of that softmax is used to determine the supervision vector to provide instead of the one hot vector that would be provided otherwise for each datapoint. The intuition behind it is that errors between similar classes are more tolerable than errors between really different classes.

#### Huber Loss¶

```
loss:
type: huber
weight: 1.0
delta: 1.0
```

Parameters:

(default:`weight`

`1.0`

): Weight of the loss.(default:`delta`

`1.0`

): Threshold at which to change between delta-scaled L1 and L2 loss.

Loss type and loss related parameters can also be defined once and applied to all vector output features using the Type-Global Loss section.

### Metrics¶

The metrics that are calculated every epoch and are available for set features are `mean_squared_error`

, `mean_absolute_error`

, `r2`

, and the `loss`

itself.

You can set any of them as `validation_metric`

in the `training`

section of the configuration if you set the
`validation_field`

to be the name of a vector feature.