Output Features (↓)
The Ludwig configuration's output_features
section has the same structure as input_features
, which is a list of
feature definitions, each of which contains a name
and a type
, with optional preprocessing
, that users want their
model to predict.
output_features:
-
name: french
type: text
{
"output_features": [
{
"name": "french",
"type": "text",
}
]
}
Decoders¶
Recall Ludwig's butterfly framework:
Instead of having encoders
, output features have decoders
. All the other parameters besides name
and type
will
be passed as parameters to the function that builds the output feature's decoder, and each decoder can have different
parameters. Extensive documentation for all of the decoders that can be used for a certain data type can be found in
each data type's documentation.
output_features:
-
name: french
type: text
decoder: generator
cell_type: lstm
num_layers: 2
max_sequence_length: 256
{
"output_features": [
{
"name": "french",
"type": "text",
"decoder": "generator",
"cell_type": "lstm",
"num_layers": 2,
"max_sequence_length": 256
}
]
}
Decoders take the output of the combiner as input, process it further, for instance passing it through fully connected layers, and predict values, which are subsequently used to compute loss and evaluation metrics.
Decoders have additional parameters, in particular loss
that allows you to specify a different loss to optimize for
this specific decoder. For instance, number features support both mean_squared_error
and mean_absolute_error
as
losses.
Details about the available decoders and losses alongside with the description of all parameters is provided in datatype-specific documentation.
Multi-task Learning¶
In most machine learning tasks you want to predict only one target variable, but in Ludwig users are empowered to specify multiple output features. During training, output features are optimized in a multi-task fashion, using a weighted sum of their losses as a combined loss. Ludwig natively supports multi-task learning.
When multiple output features are specified, the loss that is optimized is a weighted sum of the losses of each individual output feature.
By default each loss weight is 1
, but this can be changed by specifying a value for the weight
parameter in the
loss
section of each output feature definition.
For example, given a category
feature A
and number
feature B
, in order to optimize the loss
loss_total = 1.5 * loss_A + 0.8 + loss_B
the output_feature
section of the configuration should look like:
output_features:
-
name: A
type: category
loss:
weight: 1.5
-
name: A
type: number
loss:
weight: 0.8
{
"output_features": [
{
"name": "A",
"type": "category",
"loss": {
"weight": 1.5
}
},
{
"name": "A",
"type": "number",
"loss": {
"weight": 0.8
}
}
]
}
Output Feature Dependencies¶
An additional feature that Ludwig provides is the concept of dependencies between output_features
.
Sometimes output features have strong causal relationships, and knowing which prediction has been made for one can improve the prediction for the other. For example, if there are two output features: 1) coarse grained category and 2) fine-grained category, knowing the prediction made for coarse grained can productively clarify the possible choices for the fine-grained category.
Output feature dependencies are declared in the feature definition. For example:
output_features:
-
name: coarse_class
type: category
num_fc_layers: 2
output_size: 64
-
name: fine_class
type: category
dependencies:
- coarse_class
num_fc_layers: 1
output_size: 64
{
"output_features": [
{
"name": "coarse_class",
"type": "category",
"num_fc_layers": 2,
"output_size": 64
},
{
"name": "fine_class",
"type": "category",
"dependencies": [
"coarse_class"
],
"num_fc_layers": 1,
"output_size": 64
}
]
}
At model building time Ludwig checks that no cyclic dependency exists.
For the downstream feature, Ludwig will concatenate all the final representations before the prediction of any dependent output features to feed as input to the downstream feature's decoder1
-
Assuming the input coming from the combiner has hidden dimension
h
128, there are two fully connected layers that return a vector with hidden size 64 at the end of thecoarse_class
decoder (that vector will be used for the final layer before projecting in the outputcoarse_class
space). In the decoder offine_class
, the 64 dimensional vector ofcoarse_class
will be concatenated with the combiner output vector, making a vector of hidden size 192 that will be passed through a fully connected layer and the 64 dimensional output will be used for the final layer before projecting in the output class space of thefine_class
. ↩