Input Features (↑)

The input_features section is list of feature definitions. Each feature definition contains two required fields: name and type.

YAMLPython Dict

input_features:
    -
        name: Pclass
        type: category

{
    "input_features": [{"name": "Pclass", "type": "category"}]
}

name is the name of the feature in the dataset. type is one of the supported data types.

Preprocessing¶

Recall Ludwig's butterfly framework.

Each input feature can specify its own preprocessing via the preprocessing sub-section.

YAMLPython Dict

input_features:
    -
        name: Fare
        type: number
        preprocessing:
            missing_value_strategy: fill_with_mean

{
    "input_features": [
        {
            "name": "Fare",
            "type": "number",
            "preprocessing": {
                "missing_value_strategy": "fill_with_mean"
            }
        }
    ]
}

It's also possible to specify preprocessing rules for all features of a certain type. See type-global preprocessing.

Encoders¶

Each input feature can configure a specific encoder to map input feature values into tensors. For instance, a user might want to encode a sequence feature using a transformer or an image feature using a stacked_cnn. Different data types support different encoders. Check the documentation for specific feature types to see what encoders are supported for that type.

All the other parameters besides name, type, and preprocessing, will be passed as parameters to the function that build the input feature's encoder, and each encoder can have different parameters. Extensive documentation for all of the encoders that can be used for a certain data type can also be found in each data type's documentation.

YAMLPython Dict

input_features:
    -
        name: text
        type: text
        encoder: bert
        preprocessing:
            tokenizer: space
        reduce_output: null
        trainable: true

{
    "input_features": [
        {
            "name": "text",
            "type": "text",
            "encoder": "bert",
            "level": "word",
            "preprocessing": {
                "word_tokenizer": "space"
            },
            "reduce_output": None,
            "trainable": True
        }
    ]
}

Encoders map raw feature values into tensors. These are usually vectors in the case of data types without a temporal / sequential aspect, matrices for when there is a temporal / sequential aspect, or higher rank tensors for when there is a spatial or a spatiotemporal aspect to the input data.

Different configurations of the same encoder may return a tensor with different rank, for instance a sequential encoder may return a vector of size h that is either the final vector of a sequence or the result of pooling over the sequence length, or it can return a matrix of size l x h where l is the length of the sequence and h is the hidden dimension if you specify the pooling reduce operation (reduce_output) to be None. For the sake of simplicity you can imagine the output to be a vector in most of the cases, but there is a reduce_output parameter one can specify to change the default behavior.

For first-time users, we recommend starting with the defaults.

Tying encoder weights¶

An additional feature that Ludwig provides is the option to have tied weights between different encoders. For instance, if my model takes two sentences as input and return the probability of their entailment, I may want to encode both sentences with the same encoder.

This is done by specifying the tied parameter of one feature to be the name of another output feature. For example:

YAMLPython Dict

input_features:
    -
        name: sentence1
        type: text
    -
        name: sentence2
        type: text
        tied: sentence1

{
    "input_features": [
        {
            "name": "sentence1",
            "type": "text"
        },
        {
            "name": "sentence2",
            "type": "text",
            "tied": "sentence1"
        }
    ]
}

Specifying a name of a non-existent input feature will result in an error. Also, in order to be able to have tied weights, all encoder parameters have to be identical between the two input features.