Add a Feature Type
Adding a New Feature Type to Ludwig¶
This guide walks through every file you need to touch when adding a brand-new feature type (e.g. a hypothetical "widget" type). Use ludwig/features/binary_feature.py and ludwig/schema/features/binary_feature.py as living reference implementations — they are among the simplest complete examples.
Conceptual overview¶
Each feature type lives in two parallel places:
| Layer | Location | Purpose |
|---|---|---|
| Schema | ludwig/schema/features/<type>_feature.py |
Pydantic-backed config classes; declares hyperparameters and their defaults |
| Feature module | ludwig/features/<type>_feature.py |
PyTorch modules; implements preprocessing, encoding, decoding, postprocessing |
The schema classes are used for config validation and serialization. The feature module classes are instantiated at model-build time using those configs. Neither layer knows the other exists at import time — they are wired together through the feature registry.
Step 1 — Define the constant¶
Add the type string to ludwig/constants.py:
WIDGET = "widget"
Step 2 — Write the schema file¶
Create ludwig/schema/features/widget_feature.py. The minimal required structure is:
from ludwig.constants import WIDGET, MODEL_ECD
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig, BaseOutputFeatureConfig
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import (
ecd_defaults_config_registry,
ecd_input_config_registry,
ecd_output_config_registry,
input_mixin_registry,
output_mixin_registry,
)
from ludwig.schema.utils import LudwigBaseConfig
@input_mixin_registry.register(WIDGET)
class WidgetInputFeatureConfigMixin(LudwigBaseConfig):
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=WIDGET)
class WidgetInputFeatureConfig(WidgetInputFeatureConfigMixin, BaseInputFeatureConfig):
type: str = schema_utils.ProtectedString(WIDGET)
encoder: BaseEncoderConfig = None
@ecd_input_config_registry.register(WIDGET)
class ECDWidgetInputFeatureConfig(WidgetInputFeatureConfig):
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=WIDGET,
default="dense", # default encoder for this type
)
# For output features only:
@output_mixin_registry.register(WIDGET)
class WidgetOutputFeatureConfigMixin(LudwigBaseConfig):
# add loss, calibration, etc. fields here
pass
class WidgetOutputFeatureConfig(WidgetOutputFeatureConfigMixin, BaseOutputFeatureConfig):
type: str = schema_utils.ProtectedString(WIDGET)
default_validation_metric: str = "some_metric"
@ecd_output_config_registry.register(WIDGET)
class ECDWidgetOutputFeatureConfig(WidgetOutputFeatureConfig):
pass
Key rules:
typemust be aProtectedStringwith your constant — this prevents accidental overwrite via user YAML.@input_mixin_registry.register/@output_mixin_registry.registermake the preprocessing config available toglobal_defaultsin Ludwig configs.@ecd_input_config_registry.register/@ecd_output_config_registry.registerwire the schema into the ECD model config builder.
Step 3 — Write the preprocessing config¶
Create ludwig/schema/features/preprocessing/widget_feature_preprocessing.py if your feature needs non-default preprocessing parameters, or register your type against an existing one (e.g. number_feature for scalars). For a new type, create the file:
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.constants import WIDGET
@register_preprocessor(WIDGET)
class WidgetPreprocessingConfig(BasePreprocessingConfig):
# add preprocessing hyperparameters here
pass
Step 4 — Write the feature module¶
Create ludwig/features/widget_feature.py. The required classes are:
Inner preprocessing module¶
import torch
from ludwig.features.base_feature import BasePreprocessingModule, FeaturePreprocessingMixin, InputFeature, OutputFeature
class _WidgetPreprocessing(BasePreprocessingModule):
"""Runs inside the model graph during inference to preprocess raw input."""
def __init__(self, metadata: dict, preprocessing_config, is_input_feature: bool = True):
super().__init__()
# store everything needed to preprocess at inference time
def forward(self, v):
# v is the raw column value; return a tensor
raise NotImplementedError
FeatureMixin (shared preprocessing logic)¶
FeaturePreprocessingMixin provides the Python-side preprocessing used during dataset preparation (not inside the model graph). You must implement add_feature_data and get_preprocessing_module:
class WidgetFeatureMixin(FeaturePreprocessingMixin):
@staticmethod
def type():
return WIDGET
@staticmethod
def cast_column(column, backend):
"""Cast the raw DataFrame column to the expected dtype."""
return column
@staticmethod
def add_feature_data(
feature_config,
input_df,
proc_df,
metadata,
preprocessing_parameters,
backend,
skip_save_processed_input,
):
"""Populate proc_df[feature_config[PROC_COLUMN]] with preprocessed values."""
proc_df[feature_config[PROC_COLUMN]] = input_df[feature_config[COLUMN]].values
return proc_df
@staticmethod
def fill_missing_values(feature_config, input_df, backend):
"""Replace NaN/None with a fill value appropriate for this type."""
return input_df
@staticmethod
def feature_meta(column, preprocessing_parameters, backend):
"""Compute and return the training-set-level metadata dict for this feature."""
return {}
@staticmethod
def get_preprocessing_module(feature_config, metadata):
"""Return the _WidgetPreprocessing module for use during inference."""
return _WidgetPreprocessing(metadata, feature_config.preprocessing)
InputFeature class¶
from ludwig.schema.features.widget_feature import WidgetInputFeatureConfig
class WidgetInputFeature(WidgetFeatureMixin, InputFeature):
def __init__(self, input_feature_config: WidgetInputFeatureConfig, encoder_obj=None, **kwargs):
super().__init__(input_feature_config, **kwargs)
self._input_shape = torch.Size([1]) # set to actual encoded shape
if encoder_obj:
self.encoder_obj = encoder_obj
else:
self.encoder_obj = self.initialize_encoder(input_feature_config.encoder)
def forward(self, inputs, mask=None):
assert inputs.dtype == torch.float32
encoder_output = self.encoder_obj(inputs, mask=mask)
return {"encoder_output": encoder_output}
@property
def input_dtype(self):
return torch.float32
@property
def input_shape(self):
return self._input_shape
@property
def output_shape(self):
return self.encoder_obj.output_shape
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
pass
@staticmethod
def create_sample_input(batch_size=2):
return torch.zeros(batch_size, 1)
@staticmethod
def get_schema_cls():
return WidgetInputFeatureConfig
OutputFeature class (only if this type can be a target)¶
from ludwig.schema.features.widget_feature import WidgetOutputFeatureConfig
class WidgetOutputFeature(WidgetFeatureMixin, OutputFeature):
def __init__(self, output_feature_config: WidgetOutputFeatureConfig, output_features: dict, **kwargs):
super().__init__(output_feature_config, output_features, **kwargs)
self._input_shape = torch.Size([output_feature_config.input_size])
self.decoder_obj = self.initialize_decoder(output_feature_config.decoder)
self._setup_loss()
self._setup_metrics()
def logits(self, inputs, target=None):
return self.decoder_obj(inputs)
def create_predict_module(self):
return _WidgetPredict() # see PredictModule below
def get_prediction_set(self):
return {LOGITS, PREDICTIONS, PROBABILITIES}
@classmethod
def update_config_with_metadata(cls, feature_config, feature_metadata, *args, **kwargs):
feature_config.input_size = feature_metadata["input_size"]
@staticmethod
def get_schema_cls():
return WidgetOutputFeatureConfig
PredictModule (for output features)¶
from ludwig.features.base_feature import PredictModule
class _WidgetPredict(PredictModule):
def forward(self, inputs, feature_name):
logits = inputs[f"{feature_name}_{LOGITS}"]
predictions = (logits > 0.5).float()
return {PREDICTIONS: predictions, LOGITS: logits}
Step 5 — Register in the feature registries¶
Open ludwig/features/feature_registries.py and add your classes to all relevant registry functions:
# at the top — add import
from ludwig.features.widget_feature import WidgetFeatureMixin, WidgetInputFeature
# in get_base_type_registry(), inside the returned dict:
# WIDGET: WidgetFeatureMixin,
#
# in get_input_type_registry(), inside the returned dict:
# WIDGET: WidgetInputFeature,
#
# in get_output_type_registry() if applicable, inside the returned dict:
# WIDGET: WidgetOutputFeature,
The model builder uses get_input_type_registry() and get_output_type_registry() to instantiate feature objects from config at training time.
Step 6 — Register the constant in constants.py (feature sets)¶
If the feature appears in FEATURE_TYPES, INPUT_FEATURE_TYPES, or similar sets, add WIDGET there too.
Step 7 — Write tests¶
Create tests/ludwig/features/test_widget_feature.py. At minimum test:
WidgetFeatureMixin.add_feature_data— correct column values written toproc_df_WidgetPreprocessing.forward— correct tensor shape for a known inputWidgetInputFeature.forward— correct output keys and shapes with a random input- Encoder round-trip via
create_sample_input
import torch
import pytest
from tests.integration_tests.utils import generate_data, run_api_test
def test_widget_preprocessing_forward():
meta = {}
module = _WidgetPreprocessing(meta, preprocessing_config=None)
out = module(torch.zeros(4))
assert out.shape == (4, 1)
Checklist¶
- [ ]
ludwig/constants.py— addWIDGET = "widget" - [ ]
ludwig/schema/features/widget_feature.py— schema classes + registry decorators - [ ]
ludwig/schema/features/preprocessing/— preprocessing config class (or reuse existing) - [ ]
ludwig/features/widget_feature.py— preprocessing module, mixin, input/output feature classes - [ ]
ludwig/features/feature_registries.py— add toget_base_type_registry,get_input_type_registry, optionallyget_output_type_registry - [ ]
tests/ludwig/features/test_widget_feature.py— unit tests for preprocessing and forward pass
Common pitfalls¶
proc_df[PROC_COLUMN] vs proc_df[COLUMN] — always write to PROC_COLUMN (the internal column name), not COLUMN (the raw user column name). They can differ when the user renames features.
get_preprocessing_module vs add_feature_data — add_feature_data runs in Python at dataset preparation time (CPU, pandas). get_preprocessing_module returns a torch.nn.Module that runs inside the model graph at inference time. Both must produce compatible representations.
input_shape vs output_shape — InputFeature.input_shape is the shape of the raw preprocessed tensor going into the encoder. InputFeature.output_shape is the encoder's output shape that feeds into the combiner. Return self.encoder_obj.output_shape for the latter.
Registry order matters — the registry in feature_registries.py is read at import time. If you import your feature class before feature_registries.py is loaded, the registry will be empty. The correct order is always: define constants → define schema → define feature → add to registry.
Schema type field — always use schema_utils.ProtectedString(WIDGET) not str = WIDGET. The protected string raises an error if a user tries to override it in their config YAML, which prevents subtle type mismatches.