LudwigModel

LudwigModel class [source]¶

ludwig.api.LudwigModel(
  config,
  logging_level=40,
  backend=None,
  gpus=None,
  gpu_memory_limit=None,
  allow_parallel_threads=True,
  callbacks=None
)

Class that allows access to high level Ludwig functionalities.

Inputs

config (Union[str, dict]): in-memory representation of config or string path to a YAML config file.
logging_level (int): Log level that will be sent to stderr.
backend (Union[Backend, str]): Backend or string name of backend to use to execute preprocessing / training steps.
gpus (Union[str, int, List[int]], default: None): GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
gpu_memory_limit (float: default: None): maximum memory fraction [0, 1] allowed to allocate per GPU device.
allow_parallel_threads (bool, default: True): allow Torch to use multithreading parallelism to improve performance at the cost of determinism.

Example usage:

from ludwig.api import LudwigModel

Train a model:

config = {...}
ludwig_model = LudwigModel(config)
train_stats, _, _ = ludwig_model.train(dataset=file_path)

or

train_stats, _, _ = ludwig_model.train(dataset=dataframe)

If you have already trained a model you can load it and use it to predict

ludwig_model = LudwigModel.load(model_dir)

Predict:

predictions, _ = ludwig_model.predict(dataset=file_path)

or

predictions, _ = ludwig_model.predict(dataset=dataframe)

Evaluation:

eval_stats, _, _ = ludwig_model.evaluate(dataset=file_path)

or

eval_stats, _, _ = ludwig_model.evaluate(dataset=dataframe)

PublicAPI: This API is stable across Ludwig releases.

LudwigModel methods¶

collect_activations¶

collect_activations(
  layer_names,
  dataset,
  data_format=None,
  split='full',
  batch_size=128
)

Loads a pre-trained model model and input data to collect the values of the activations contained in the tensors.

Inputs

layer_names (list): list of strings for layer names in the model to collect activations.
dataset (Union[str, Dict[str, list], pandas.DataFrame]): source containing the data to make predictions.
data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
split (str, default= 'full'):: if the input dataset contains a split column, this parameter indicates which split of the data to use. Possible values are 'full', 'training', 'validation', 'test'.
batch_size (int, default: 128): size of batch to use when making predictions.

Return

return (list): list of collected tensors.

collect_weights¶

collect_weights(
  tensor_names=None
)

Load a pre-trained model and collect the tensors with a specific name.

Inputs

tensor_names (list, default: None): List of tensor names to collect weights

Return

return (list): List of tensors

create_model¶

create_model(
  config_obj,
  random_seed=42
)

Instantiates BaseModel object.

Inputs

config_obj (Union[Config, dict]): Ludwig config object
random_seed (int, default: ludwig default random seed): Random seed used for weights initialization, splits and any other random function.

Return

return (ludwig.models.BaseModel): Instance of the Ludwig model object.

evaluate¶

ludwig.evaluate(
  dataset=None,
  data_format=None,
  split='full',
  batch_size=None,
  skip_save_unprocessed_output=True,
  skip_save_predictions=True,
  skip_save_eval_stats=True,
  collect_predictions=False,
  collect_overall_stats=False,
  output_directory='results',
  return_type=<class 'pandas.core.frame.DataFrame'>
)

This function is used to predict the output variables given the input variables using the trained model and compute test statistics like performance measures, confusion matrices and the like.

Inputs

dataset (Union[str, dict, pandas.DataFrame]): source containing the entire dataset to be evaluated.
data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
split (str, default='full'):: if the input dataset contains a split column, this parameter indicates which split of the data to use. Possible values are 'full', 'training', 'validation', 'test'.
batch_size (int, default: None): size of batch to use when making predictions. Defaults to model config eval_batch_size
skip_save_unprocessed_output (bool, default: True): if this parameter is False, predictions and their probabilities are saved in both raw unprocessed numpy files containing tensors and as postprocessed CSV files (one for each output feature). If this parameter is True, only the CSV ones are saved and the numpy ones are skipped.
skip_save_predictions (bool, default: True): skips saving test predictions CSV files.
skip_save_eval_stats (bool, default: True): skips saving test statistics JSON file.
collect_predictions (bool, default: False): if True collects post-processed predictions during eval.
collect_overall_stats (bool, default: False): if True collects overall stats during eval.
output_directory (str, default: 'results'): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files.
return_type (Union[str, dict, pd.DataFrame], default: pandas.DataFrame): indicates the format to of the returned predictions.

Return

return (evaluation_statistics, predictions, output_directory): evaluation_statistics dictionary containing evaluation performance statistics, postprocess_predictions contains predicted values, output_directory is location where results are stored.

experiment¶

experiment(
  dataset=None,
  training_set=None,
  validation_set=None,
  test_set=None,
  training_set_metadata=None,
  data_format=None,
  experiment_name='experiment',
  model_name='run',
  model_resume_path=None,
  eval_split='test',
  skip_save_training_description=False,
  skip_save_training_statistics=False,
  skip_save_model=False,
  skip_save_progress=False,
  skip_save_log=False,
  skip_save_processed_input=False,
  skip_save_unprocessed_output=False,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  skip_collect_predictions=False,
  skip_collect_overall_stats=False,
  output_directory='results',
  random_seed=42
)

Trains a model on a dataset's training and validation splits and uses it to predict on the test split. It saves the trained model and the statistics of training and testing.

Inputs

dataset (Union[str, dict, pandas.DataFrame], default: None): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split.
training_set (Union[str, dict, pandas.DataFrame], default: None): source containing training data.
validation_set (Union[str, dict, pandas.DataFrame], default: None): source containing validation data.
test_set (Union[str, dict, pandas.DataFrame], default: None): source containing test data.
training_set_metadata (Union[str, dict], default: None): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension.
data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
experiment_name (str, default: 'experiment'): name for the experiment.
model_name (str, default: 'run'): name of the model that is being used.
model_resume_path (str, default: None): resumes training of the model from the path specified. The config is restored. In addition to config, training statistics and loss for epoch and the state of the optimizer are restored such that training can be effectively continued from a previously interrupted training process.
eval_split (str, default: test): split on which to perform evaluation. Valid values are training, validation and test.
skip_save_training_description (bool, default: False): disables saving the description JSON file.
skip_save_training_statistics (bool, default: False): disables saving training statistics JSON file.
skip_save_model (bool, default: False): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation metric improves, but if the model is really big that can be time consuming. If you do not want to keep the weights and just find out what performance a model can get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on and the returned model will have the weights obtained at the end of training, instead of the weights of the epoch with the best validation performance.
skip_save_progress (bool, default: False): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on.
skip_save_log (bool, default: False): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed.
skip_save_processed_input (bool, default: False): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter is False, the HDF5 and JSON file are not saved.
skip_save_unprocessed_output (bool, default: False): by default predictions and their probabilities are saved in both raw unprocessed numpy files containing tensors and as postprocessed CSV files (one for each output feature). If this parameter is True, only the CSV ones are saved and the numpy ones are skipped.
skip_save_predictions (bool, default: False): skips saving test predictions CSV files
skip_save_eval_stats (bool, default: False): skips saving test statistics JSON file
skip_collect_predictions (bool, default: False): skips collecting post-processed predictions during eval.
skip_collect_overall_stats (bool, default: False): skips collecting overall stats during eval.
output_directory (str, default: 'results'): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files.
random_seed (int: default: 42): random seed used for weights initialization, splits and any other random function.

Return

return (Tuple[dict, dict, tuple, str)): (evaluation_statistics, training_statistics, preprocessed_data, output_directory) evaluation_statistics dictionary with evaluation performance statistics on the test_set, training_statistics is a nested dictionary of dataset -> feature_name -> metric_name -> List of metrics. Each metric corresponds to each training checkpoint. preprocessed_data tuple containing preprocessed (training_set, validation_set, test_set), output_directory filepath string to where results are stored.

forecast¶

forecast(
  dataset,
  data_format=None,
  horizon=1,
  output_directory=None,
  output_format='parquet'
)

free_gpu_memory¶

free_gpu_memory(
)

Manually moves the model to CPU to force GPU memory to be freed.

For more context: https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530/35

generate¶

generate(
  input_strings,
  generation_config=None,
  streaming=False
)

A simple generate() method that directly uses the underlying transformers library to generate text.

Args: input_strings (Union[str, List[str]]): Input text or list of texts to generate from. generation_config (Optional[dict]): Configuration for text generation. streaming (Optional[bool]): If True, enable streaming output.

Returns: Union[str, List[str]]: Generated text or list of generated texts.

is_merge_and_unload_set¶

is_merge_and_unload_set(
)

Check whether the encapsulated model is of type LLM and is configured to merge_and_unload QLoRA weights.

Return

:return (bool): whether merge_and_unload should be done.

load¶

load(
  model_dir,
  logging_level=40,
  backend=None,
  gpus=None,
  gpu_memory_limit=None,
  allow_parallel_threads=True,
  callbacks=None,
  from_checkpoint=False
)

This function allows for loading pretrained models.

Inputs

model_dir (str): path to the directory containing the model. If the model was trained by the train or experiment command, the model is in results_dir/experiment_dir/model.
logging_level (int, default: 40): log level that will be sent to stderr.
backend (Union[Backend, str]): Backend or string name of backend to use to execute preprocessing / training steps.
gpus (Union[str, int, List[int]], default: None): GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
gpu_memory_limit (float: default: None): maximum memory fraction [0, 1] allowed to allocate per GPU device.
allow_parallel_threads (bool, default: True): allow Torch to use multithreading parallelism to improve performance at the cost of determinism.
callbacks (list, default: None): a list of ludwig.callbacks.Callback objects that provide hooks into the Ludwig pipeline.
from_checkpoint (bool, default: False): if True, the model will be loaded from the latest checkpoint (training_checkpoints/) instead of the final model weights.

Return

return (LudwigModel): a LudwigModel object

Example usage

ludwig_model = LudwigModel.load(model_dir)

load_weights¶

load_weights(
  model_dir,
  from_checkpoint=False
)

Loads weights from a pre-trained model.

Inputs

model_dir (str): filepath string to location of a pre-trained model
from_checkpoint (bool, default: False): if True, the model will be loaded from the latest checkpoint (training_checkpoints/) instead of the final model weights.

Return

return ( Non):None`

Example usage

ludwig_model.load_weights(model_dir)

predict¶

ludwig.predict(
  dataset=None,
  data_format=None,
  split='full',
  batch_size=128,
  generation_config=None,
  skip_save_unprocessed_output=True,
  skip_save_predictions=True,
  output_directory='results',
  return_type=<class 'pandas.core.frame.DataFrame'>,
  callbacks=None
)

Using a trained model, make predictions from the provided dataset.

Inputs

dataset (Union[str, dict, pandas.DataFrame]):: source containing the entire dataset to be evaluated.
data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
split (str, default= 'full'):: if the input dataset contains a split column, this parameter indicates which split of the data to use. Possible values are 'full', 'training', 'validation', 'test'.
batch_size (int, default: 128): size of batch to use when making predictions.
generation_config (Dict, default: None): config for the generation of the predictions. If None, the config that was used during model training is used. This is only used if the model type is LLM. Otherwise, this parameter is ignored. See Large Language Models under "Generation" for an example generation config.
skip_save_unprocessed_output (bool, default: True): if this parameter is False, predictions and their probabilities are saved in both raw unprocessed numpy files containing tensors and as postprocessed CSV files (one for each output feature). If this parameter is True, only the CSV ones are saved and the numpy ones are skipped.
skip_save_predictions (bool, default: True): skips saving test predictions CSV files.
output_directory (str, default: 'results'): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files.
return_type (Union[str, dict, pandas.DataFrame], default: pd.DataFrame): indicates the format of the returned predictions.
callbacks (Optional[List[Callback]], default: None): optional list of callbacks to use during this predict operation. Any callbacks already registered to the model will be preserved.

Return

:return (predictions, output_directory): (Tuple[Union[dict, pd.DataFrame], str]) predictions predictions from the provided dataset, output_directory filepath string to where data was stored.

preprocess¶

preprocess(
  dataset=None,
  training_set=None,
  validation_set=None,
  test_set=None,
  training_set_metadata=None,
  data_format=None,
  skip_save_processed_input=True,
  random_seed=42
)

This function is used to preprocess data.

Args:

dataset (Union[str, dict, pandas.DataFrame], default: None): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split.
training_set (Union[str, dict, pandas.DataFrame], default: None): source containing training data.
validation_set (Union[str, dict, pandas.DataFrame], default: None): source containing validation data.
test_set (Union[str, dict, pandas.DataFrame], default: None): source containing test data.
training_set_metadata (Union[str, dict], default: None): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension.
data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
skip_save_processed_input (bool, default: False): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter is False, the HDF5 and JSON file are not saved.
random_seed (int, default: 42): a random seed that will be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling

Returns:

__:return__: (PreprocessedDataset) data structure containing (proc_training_set, proc_validation_set, proc_test_set, training_set_metadata).

Raises:

RuntimeError: An error occured while preprocessing the data. Examples include training dataset being empty after preprocessing, lazy loading not being supported with RayBackend, etc.

save¶

save(
  save_path
)

This function allows to save models on disk.

Inputs

__ save_path__ (str): path to the directory where the model is going to be saved. Both a JSON file containing the model architecture hyperparameters and checkpoints files containing model weights will be saved.

Return

return (None): None

Example usage

ludwig_model.save(save_path)

save_config¶

save_config(
  save_path
)

Save config to specified location.

Inputs

save_path (str): filepath string to save config as a JSON file.

Return

return ( None):None`

save_dequantized_base_model¶

save_dequantized_base_model(
  save_path
)

Upscales quantized weights of a model to fp16 and saves the result in a specified folder.

Args: save_path (str): The path to the folder where the upscaled model weights will be saved.

Raises: ValueError: If the model type is not 'llm' or if quantization is not enabled or the number of bits is not 4 or 8. RuntimeError: If no GPU is available, as GPU is required for quantized models.

Returns: None

save_torchscript¶

save_torchscript(
  save_path,
  model_only=False,
  device=None
)

Saves the Torchscript model to disk.

Inputs

save_path (str) (str):: The path to the directory where the model will be saved.
model_only (bool, optional) (bool, optional):: If True, only the ECD model will be converted to Torchscript. Else, the preprocessing and postprocessing steps will also be converted to Torchscript.
device (TorchDevice, optional) (TorchDevice, optional):: If None, the model will be converted to Torchscript on the same device to ensure maximum model parity.

Return

return ( None):None`

set_logging_level¶

set_logging_level(
  logging_level
)

Sets level for log messages.

Inputs

logging_level (int): Set/Update the logging level. Use logging constants like logging.DEBUG , logging.INFO and logging.ERROR.

Return

return ( None):None`

to_torchscript¶

to_torchscript(
  model_only=False,
  device=None
)

Converts the trained model to Torchscript.

Inputs

__ model_only (bool, optional)__ (bool, optional):: If True, only the ECD model will be converted to Torchscript. Else, preprocessing and postprocessing steps will also be converted to Torchscript.
device (TorchDevice, optional) (TorchDevice, optional):: If None, the model will be converted to Torchscript on the same device to ensure maximum model parity.

Returns

return ( A torch.jit.ScriptModule that can be used to predict on a dictionary of inputs): A torch.jit.ScriptModule that can be used to predict on a dictionary of inputs.

train¶

train(
  dataset=None,
  training_set=None,
  validation_set=None,
  test_set=None,
  training_set_metadata=None,
  data_format=None,
  experiment_name='api_experiment',
  model_name='run',
  model_resume_path=None,
  skip_save_training_description=False,
  skip_save_training_statistics=False,
  skip_save_model=False,
  skip_save_progress=False,
  skip_save_log=False,
  skip_save_processed_input=False,
  output_directory='results',
  random_seed=42
)

This function is used to perform a full training of the model on the specified dataset.

During training if the skip parameters are False the model and statistics will be saved in a directory [output_dir]/[experiment_name]_[model_name]_n where all variables are resolved to user specified ones and n is an increasing number starting from 0 used to differentiate among repeated runs.

Inputs

dataset (Union[str, dict, pandas.DataFrame], default: None): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split.
training_set (Union[str, dict, pandas.DataFrame], default: None): source containing training data.
validation_set (Union[str, dict, pandas.DataFrame], default: None): source containing validation data.
test_set (Union[str, dict, pandas.DataFrame], default: None): source containing test data.
training_set_metadata (Union[str, dict], default: None): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension.
data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
experiment_name (str, default: 'experiment'): name for the experiment.
model_name (str, default: 'run'): name of the model that is being used.
model_resume_path (str, default: None): resumes training of the model from the path specified. The config is restored. In addition to config, training statistics, loss for each epoch and the state of the optimizer are restored such that training can be effectively continued from a previously interrupted training process.
skip_save_training_description (bool, default: False): disables saving the description JSON file.
skip_save_training_statistics (bool, default: False): disables saving training statistics JSON file.
skip_save_model (bool, default: False): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation metric improves, but if the model is really big that can be time consuming. If you do not want to keep the weights and just find out what performance a model can get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on and the returned model will have the weights obtained at the end of training, instead of the weights of the epoch with the best validation performance.
skip_save_progress (bool, default: False): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on.
skip_save_log (bool, default: False): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed.
skip_save_processed_input (bool, default: False): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter is False, the HDF5 and JSON file are not saved.
output_directory (str, default: 'results'): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files.
random_seed (int, default: 42): a random seed that will be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling
kwargs (dict, default: {}): a dictionary of optional parameters.

Return

return (Tuple[Dict, Union[Dict, pd.DataFrame], str]): tuple containing (training_statistics, preprocessed_data, output_directory). training_statistics is a nested dictionary of dataset -> feature_name -> metric_name -> List of metrics. Each metric corresponds to each training checkpoint. preprocessed_data is the tuple containing these three data sets (training_set, validation_set, test_set). output_directory filepath to where training results are stored.

train_online¶

train_online(
  dataset,
  training_set_metadata=None,
  data_format='auto',
  random_seed=42
)

Performs one epoch of training of the model on dataset.

Inputs

dataset (Union[str, dict, pandas.DataFrame], default: None): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split.
training_set_metadata (Union[str, dict], default: None): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension.
data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
random_seed (int, default: 42): a random seed that is going to be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling

Return

return (None): None

upload_to_hf_hub¶

ludwig.upload_to_hf_hub(
  repo_id,
  model_path,
  repo_type='model',
  private=False,
  commit_message='Upload trained [Ludwig](https://ludwig.ai/latest/) model weights',
  commit_description=None
)

Uploads trained model artifacts to the HuggingFace Hub.

Inputs

repo_id (str): A namespace (user or an organization) and a repo name separated by a /.
model_path (str): The path of the saved model. This is either (a) the folder where the 'model_weights' folder and the 'model_hyperparameters.json' file are stored, or (b) the parent of that folder.
private (bool, optional, defaults to False): Whether the model repo should be private.
repo_type (str, optional): Set to "dataset" or "space" if uploading to a dataset or space, None or "model" if uploading to a model. Default is None.
commit_message (str, optional): The summary / title / first line of the generated commit. Defaults to: f"Upload {path_in_repo} with huggingface_hub"
commit_description (str optional): The description of the generated commit

Returns

return (bool): True for success, False for failure.

Module functions¶

kfold_cross_validate¶

ludwig.api.kfold_cross_validate(
  num_folds,
  config,
  dataset=None,
  data_format=None,
  skip_save_training_description=False,
  skip_save_training_statistics=False,
  skip_save_model=False,
  skip_save_progress=False,
  skip_save_log=False,
  skip_save_processed_input=False,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  skip_collect_predictions=False,
  skip_collect_overall_stats=False,
  output_directory='results',
  random_seed=42,
  gpus=None,
  gpu_memory_limit=None,
  allow_parallel_threads=True,
  backend=None,
  logging_level=20
)

Performs k-fold cross validation and returns result data structures.

Inputs

num_folds (int): number of folds to create for the cross-validation
config (Union[dict, str]): model specification required to build a model. Parameter may be a dictionary or string specifying the file path to a yaml configuration file. Refer to the User Guide for details.
dataset (Union[str, dict, pandas.DataFrame], default: None): source containing the entire dataset to be used for k_fold processing.
data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'. Currently hdf5 format is not supported for k_fold cross validation.
skip_save_training_description (bool, default: False): disables saving the description JSON file.
skip_save_training_statistics (bool, default: False): disables saving training statistics JSON file.
skip_save_model (bool, default: False): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation metric improves, but if the model is really big that can be time consuming. If you do not want to keep the weights and just find out what performance a model can get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on and the returned model will have the weights obtained at the end of training, instead of the weights of the epoch with the best validation performance.
skip_save_progress (bool, default: False): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on.
skip_save_log (bool, default: False): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed.
skip_save_processed_input (bool, default: False): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter is False, the HDF5 and JSON file are not saved.
skip_save_predictions (bool, default: False): skips saving test predictions CSV files.
skip_save_eval_stats (bool, default: False): skips saving test statistics JSON file.
skip_collect_predictions (bool, default: False): skips collecting post-processed predictions during eval.
skip_collect_overall_stats (bool, default: False): skips collecting overall stats during eval.
output_directory (str, default: 'results'): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files.
random_seed (int, default: 42): Random seed used for weights initialization, splits and any other random function.
gpus (list, default: None): list of GPUs that are available for training.
gpu_memory_limit (float: default: None): maximum memory fraction [0, 1] allowed to allocate per GPU device.
allow_parallel_threads (bool, default: True): allow Torch to use multithreading parallelism to improve performance at the cost of determinism.
backend (Union[Backend, str]): Backend or string name of backend to use to execute preprocessing / training steps.
logging_level (int, default: INFO): log level to send to stderr.

Return

return (tuple(kfold_cv_statistics, kfold_split_indices), dict): a tuple of dictionaries kfold_cv_statistics: contains metrics from cv run. kfold_split_indices: indices to split training data into training fold and test fold.

PublicAPI: This API is stable across Ludwig releases.

hyperopt¶

ludwig.hyperopt.run.hyperopt(
  config,
  dataset=None,
  training_set=None,
  validation_set=None,
  test_set=None,
  training_set_metadata=None,
  data_format=None,
  experiment_name='hyperopt',
  model_name='run',
  resume=None,
  skip_save_training_description=False,
  skip_save_training_statistics=False,
  skip_save_model=False,
  skip_save_progress=False,
  skip_save_log=False,
  skip_save_processed_input=True,
  skip_save_unprocessed_output=False,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  skip_save_hyperopt_statistics=False,
  output_directory='results',
  gpus=None,
  gpu_memory_limit=None,
  allow_parallel_threads=True,
  callbacks=None,
  tune_callbacks=None,
  backend=None,
  random_seed=42,
  hyperopt_log_verbosity=3
)

This method performs an hyperparameter optimization.

Inputs

config (Union[str, dict]): config which defines the different parameters of the model, features, preprocessing and training. If str, filepath to yaml configuration file.
dataset (Union[str, dict, pandas.DataFrame], default: None): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split.
training_set (Union[str, dict, pandas.DataFrame], default: None): source containing training data.
validation_set (Union[str, dict, pandas.DataFrame], default: None): source containing validation data.
test_set (Union[str, dict, pandas.DataFrame], default: None): source containing test data.
training_set_metadata (Union[str, dict], default: None): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension.
data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
experiment_name (str, default: 'experiment'): name for the experiment.
model_name (str, default: 'run'): name of the model that is being used.
resume (bool): If true, continue hyperopt from the state of the previous run in the output directory with the same experiment name. If false, will create new trials, ignoring any previous state, even if they exist in the output_directory. By default, will attempt to resume if there is already an existing experiment with the same name, and will create new trials if not.
skip_save_training_description (bool, default: False): disables saving the description JSON file.
skip_save_training_statistics (bool, default: False): disables saving training statistics JSON file.
skip_save_model (bool, default: False): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation metric improves, but if the model is really big that can be time consuming. If you do not want to keep the weights and just find out what performance a model can get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on and the returned model will have the weights obtained at the end of training, instead of the weights of the epoch with the best validation performance.
skip_save_progress (bool, default: False): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on.
skip_save_log (bool, default: False): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed.
skip_save_processed_input (bool, default: False): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter is False, the HDF5 and JSON file are not saved.
skip_save_unprocessed_output (bool, default: False): by default predictions and their probabilities are saved in both raw unprocessed numpy files containing tensors and as postprocessed CSV files (one for each output feature). If this parameter is True, only the CSV ones are saved and the numpy ones are skipped.
skip_save_predictions (bool, default: False): skips saving test predictions CSV files.
skip_save_eval_stats (bool, default: False): skips saving test statistics JSON file.
skip_save_hyperopt_statistics (bool, default: False): skips saving hyperopt stats file.
output_directory (str, default: 'results'): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files.
gpus (list, default: None): list of GPUs that are available for training.
gpu_memory_limit (float: default: None): maximum memory fraction [0, 1] allowed to allocate per GPU device.
allow_parallel_threads (bool, default: True): allow PyTorch to use multithreading parallelism to improve performance at the cost of determinism.
callbacks (list, default: None): a list of ludwig.callbacks.Callback objects that provide hooks into the Ludwig pipeline.
backend (Union[Backend, str]): Backend or string name of backend to use to execute preprocessing / training steps.
random_seed (int: default: 42): random seed used for weights initialization, splits and any other random function.
hyperopt_log_verbosity (int: default: 3): controls verbosity of ray tune log messages. Valid values: 0 = silent, 1 = only status updates, 2 = status and brief trial results, 3 = status and detailed trial results.

Return

return (List[dict]): List of results for each trial, ordered by descending performance on the target metric.