Skip to content

Output Features

The output_features list has the same structure of the input_features list: it is a list of dictionaries containing a name and a type. They represent outputs / targets that you want your model to predict. In most machine learning tasks you want to predict only one target variable, but in Ludwig you are allowed to specify as many outputs as you want and they are going to be optimized in a multi-task fashion, using a weighted sum of their losses as a combined loss to optimize.

Instead of having encoders, output features have decoders, but most of them have only one decoder so you don't have to specify it.

Decoders take the output of the combiner as input, process it further, for instance passing it through fully connected layers, and finally predict values and compute a loss and some measures (depending on the datatype different losses and measures apply).

Decoders have additional parameters, in particular loss that allows you to specify a different loss to optimize for this specific decoder, for instance numerical features support both mean_squared_error and mean_absolute_error as losses. Details about the available decoders and losses alongside with the description of all parameters will be provided in the datatype-specific sections.

For the sake of simplicity you can imagine the input coming from the combiner to be a vector in most of the cases, but there is a reduce_input parameter one can specify to change the default behavior.

Multi-task Learning

As Ludwig allows for multiple output features to be specified and each output feature can be seen as a task the model is learning to perform, by consequence Ludwig supports Multi-task learning natively. When multiple output features are specified, the loss that is optimized is a weighted sum of the losses of each individual output feature. By default each loss weight is 1, but it can be changed by specifying a value for the weight parameter in the loss section of each output feature definition.

For example, given a category feature A and numerical feature B, in order to optimize the loss loss_total = 1.5 * loss_A + 0.8 + loss_B the output_feature section of the configuration should look like:

output_features:
    -
        name: A
        type: category
        loss:
          weight: 1.5
    -
        name: A
        type: numerical
        loss:
          weight: 0.8

Output Features Dependencies

An additional feature that Ludwig provides is the concept of dependency between output_features. You can specify a list of output features as dependencies when you write the dictionary of a specific feature. At model building time Ludwig checks that no cyclic dependency exists. If you do so Ludwig will concatenate all the final representations before the prediction of those output features to the original input of the decoder. The reason is that if different output features have a causal dependency, knowing which prediction has been made for one can help making the prediction of the other.

For instance if two output features are one coarse grained category and one fine-grained category that are in a hierarchical structure with each other, knowing the prediction made for coarse grained restricts the possible categories to predict for the fine-grained. In this case the following configuration structure can be used:

output_features:
    -
        name: coarse_class
        type: category
        num_fc_layers: 2
        fc_size: 64
    -
        name: fine_class
        type: category
        dependencies:
            - coarse_class
        num_fc_layers: 1
        fc_size: 64

Assuming the input coming from the combiner has hidden dimension h 128, there are two fully connected layers that return a vector with hidden size 64 at the end of the coarse_class decoder (that vector will be used for the final layer before projecting in the output coarse_class space). In the decoder of fine_class, the 64 dimensional vector of coarse_class will be concatenated to the combiner output vector, making a vector of hidden size 192 that will be passed through a fully connected layer and the 64 dimensional output will be used for the final layer before projecting in the output class space of the fine_class.