AutoML

Ludwig AutoML takes a dataset, the target column, and a time budget, and returns a trained Ludwig model.

Ludwig AutoML is currently experimental and is focused on tabular datasets. A blog describing its development, evaluation, and use is here.

Ludwig AutoML infers the types of the input and output features, chooses the model architecture, and launches a Ray Tune Async HyperBand search job across a set of hyperparameters and ranges, limited by the specified time budget. It returns the set of models produced by the trials in the search sorted from best to worst, along with a hyperparameter search report, which can be inspected manually or post-processed by various Ludwig visualization tools.

Users can audit and interact with Ludwig AutoML in various ways, described below.

auto_train¶

The basic API for Ludwig AutoML is auto_train. A simple example of its invocation can be found here.

import logging
import pprint

from ludwig.automl import auto_train
from ludwig.datasets import mushroom_edibility
from ludwig.utils.dataset_utils import get_repeatable_train_val_test_split

mushroom_df = mushroom_edibility.load()
mushroom_edibility_df = get_repeatable_train_val_test_split(mushroom_df, 'class', random_seed=42)

auto_train_results = auto_train(
    dataset=mushroom_edibility_df,
    target='class',
    time_limit_s=7200,
    tune_for_memory=False,
    user_config={'preprocessing': {'split': {'column': 'split', 'type': 'fixed'}}},
)

pprint.pprint(auto_train_results)

create_auto_config¶

The Ludwig AutoML create_auto_config API outputs auto_train’s hyperparameter search configuration without running the search. This API is useful for examining AutoML's chosen input and output feature types, model architecture, and hyperparameters and ranges. A simple example of its invocation:

import logging
import pprint

from ludwig.automl import create_auto_config
from ludwig.datasets import mushroom_edibility
from ludwig.utils.dataset_utils import get_repeatable_train_val_test_split

mushroom_df = mushroom_edibility.load()
mushroom_edibility_df = get_repeatable_train_val_test_split(mushroom_df, 'class', random_seed=42)

auto_config = create_auto_config(
    dataset=mushroom_edibility_df,
    target='class',
    time_limit_s=7200,
    tune_for_memory=False,
    user_config={'preprocessing': {'split': {'column': 'split', 'type': 'fixed'}}},
)

pprint.pprint(auto_config)

Source

The API is also useful for manual refinement of the AutoML-generated search; the output of this API can be edited and then directly used as the input configuration for a Ludwig hyperparameter search job.

Overriding auto configs with user_config¶

The user_config parameter can be provided to the auto_train or create_auto_config APIs to override specified parts of the configuration produced.

For example, we can specify that the TripType output feature for the Walmart Recruiting dataset specifies be set to type category, to override the Ludwig AutoML type detection system’s characterization of the feature as a number feature.

import logging
import pprint

from ludwig.automl import auto_train
from ludwig.datasets import walmart_recruiting
from ludwig.utils.dataset_utils import get_repeatable_train_val_test_split

walmart_df = walmart_recruiting.load()
walmart_recruiting_df = get_repeatable_train_val_test_split(walmart_df, 'TripType', random_seed=42)

auto_train_results = auto_train(
    dataset=walmart_recruiting_df,
    target='TripType',
    time_limit_s=3600,
    tune_for_memory=False,
    user_config={'output_features': [{'column': 'TripType', 'name': 'TripType', 'type': 'category'}],
        'preprocessing': {'split': {'column': 'split', 'type': 'fixed'}}},
)

pprint.pprint(auto_train_results)

Source

We can also specify that the hyperparameter search job optimize for maximum accuracy of the specified output feature rather than minimal loss of all combined output features, which is the default.

import logging
import pprint

from ludwig.automl import auto_train
from ludwig.datasets import mushroom_edibility
from ludwig.utils.dataset_utils import get_repeatable_train_val_test_split

mushroom_df = mushroom_edibility.load()
mushroom_edibility_df = get_repeatable_train_val_test_split(mushroom_df, 'class', random_seed=42)

auto_train_results = auto_train(
    dataset=mushroom_edibility_df,
    target='class',
    time_limit_s=3600,
    tune_for_memory=False,
    user_config={'hyperopt': {'goal': 'maximize', 'metric': 'accuracy', 'output_feature': 'class'},
        'preprocessing': {'split': {'column': 'split', 'type': 'fixed'}}},
)

pprint.pprint(auto_train_results)

Source