LudwigModel
LudwigModel class [source]¶
ludwig.api.LudwigModel(
config,
logging_level=40,
backend=None,
gpus=None,
gpu_memory_limit=None,
allow_parallel_threads=True,
callbacks=None
)
Class that allows access to high level Ludwig functionalities.
Inputs
- config (Union[str, dict]): in-memory representation of config or string path to a YAML config file.
- logging_level (int): Log level that will be sent to stderr.
- backend (Union[Backend, str]):
Backend
or string name of backend to use to execute preprocessing / training steps. - gpus (Union[str, int, List[int]], default:
None
): GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES) - gpu_memory_limit (float: default:
None
): maximum memory fraction [0, 1] allowed to allocate per GPU device. - allow_parallel_threads (bool, default:
True
): allow Torch to use multithreading parallelism to improve performance at the cost of determinism.
Example usage:
from ludwig.api import LudwigModel
Train a model:
config = {...}
ludwig_model = LudwigModel(config)
train_stats, _, _ = ludwig_model.train(dataset=file_path)
or
train_stats, _, _ = ludwig_model.train(dataset=dataframe)
If you have already trained a model you can load it and use it to predict
ludwig_model = LudwigModel.load(model_dir)
Predict:
predictions, _ = ludwig_model.predict(dataset=file_path)
or
predictions, _ = ludwig_model.predict(dataset=dataframe)
Evaluation:
eval_stats, _, _ = ludwig_model.evaluate(dataset=file_path)
or
eval_stats, _, _ = ludwig_model.evaluate(dataset=dataframe)
PublicAPI: This API is stable across Ludwig releases.
LudwigModel methods¶
collect_activations¶
collect_activations(
layer_names,
dataset,
data_format=None,
split='full',
batch_size=128
)
Loads a pre-trained model model and input data to collect the values of the activations contained in the tensors.
Inputs
- layer_names (list): list of strings for layer names in the model to collect activations.
- dataset (Union[str, Dict[str, list], pandas.DataFrame]): source containing the data to make predictions.
- data_format (str, default:
None
): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are'auto'
,'csv'
,'df'
,'dict'
,'excel'
,'feather'
,'fwf'
,'hdf5'
(cache file produced during previous training),'html'
(file containing a single HTML<table>
),'json'
,'jsonl'
,'parquet'
,'pickle'
(pickled Pandas DataFrame),'sas'
,'spss'
,'stata'
,'tsv'
. - split (str, default=
'full'
):: if the input dataset contains a split column, this parameter indicates which split of the data to use. Possible values are'full'
,'training'
,'validation'
,'test'
. - batch_size (int, default: 128): size of batch to use when making predictions.
Return
- return (list): list of collected tensors.
collect_weights¶
collect_weights(
tensor_names=None
)
Load a pre-trained model and collect the tensors with a specific name.
Inputs
- tensor_names (list, default:
None
): List of tensor names to collect weights
Return
- return (list): List of tensors
create_model¶
create_model(
config_obj,
random_seed=42
)
Instantiates BaseModel object.
Inputs
- config_obj (Union[Config, dict]): Ludwig config object
- random_seed (int, default: ludwig default random seed): Random seed used for weights initialization, splits and any other random function.
Return
- return (ludwig.models.BaseModel): Instance of the Ludwig model object.
evaluate¶
ludwig.evaluate(
dataset=None,
data_format=None,
split='full',
batch_size=None,
skip_save_unprocessed_output=True,
skip_save_predictions=True,
skip_save_eval_stats=True,
collect_predictions=False,
collect_overall_stats=False,
output_directory='results',
return_type=<class 'pandas.core.frame.DataFrame'>
)
This function is used to predict the output variables given the input variables using the trained model and compute test statistics like performance measures, confusion matrices and the like.
Inputs
- dataset (Union[str, dict, pandas.DataFrame]): source containing the entire dataset to be evaluated.
- data_format (str, default:
None
): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are'auto'
,'csv'
,'df'
,'dict'
,'excel'
,'feather'
,'fwf'
,'hdf5'
(cache file produced during previous training),'html'
(file containing a single HTML<table>
),'json'
,'jsonl'
,'parquet'
,'pickle'
(pickled Pandas DataFrame),'sas'
,'spss'
,'stata'
,'tsv'
. - split (str, default=
'full'
):: if the input dataset contains a split column, this parameter indicates which split of the data to use. Possible values are'full'
,'training'
,'validation'
,'test'
. - batch_size (int, default: None): size of batch to use when making predictions. Defaults to model config eval_batch_size
- skip_save_unprocessed_output (bool, default:
True
): if this parameter isFalse
, predictions and their probabilities are saved in both raw unprocessed numpy files containing tensors and as postprocessed CSV files (one for each output feature). If this parameter isTrue
, only the CSV ones are saved and the numpy ones are skipped. - skip_save_predictions (bool, default:
True
): skips saving test predictions CSV files. - skip_save_eval_stats (bool, default:
True
): skips saving test statistics JSON file. - collect_predictions (bool, default:
False
): ifTrue
collects post-processed predictions during eval. - collect_overall_stats (bool, default: False): if
True
collects overall stats during eval. - output_directory (str, default:
'results'
): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files. - return_type (Union[str, dict, pd.DataFrame], default: pandas.DataFrame): indicates the format to of the returned predictions.
Return
- return (
evaluation_statistics
,predictions
,output_directory
):evaluation_statistics
dictionary containing evaluation performance statistics,postprocess_predictions
contains predicted values,output_directory
is location where results are stored.
experiment¶
experiment(
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
data_format=None,
experiment_name='experiment',
model_name='run',
model_resume_path=None,
eval_split='test',
skip_save_training_description=False,
skip_save_training_statistics=False,
skip_save_model=False,
skip_save_progress=False,
skip_save_log=False,
skip_save_processed_input=False,
skip_save_unprocessed_output=False,
skip_save_predictions=False,
skip_save_eval_stats=False,
skip_collect_predictions=False,
skip_collect_overall_stats=False,
output_directory='results',
random_seed=42
)
Trains a model on a dataset's training and validation splits and uses it to predict on the test split. It saves the trained model and the statistics of training and testing.
Inputs
- dataset (Union[str, dict, pandas.DataFrame], default:
None
): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split. - training_set (Union[str, dict, pandas.DataFrame], default:
None
): source containing training data. - validation_set (Union[str, dict, pandas.DataFrame], default:
None
): source containing validation data. - test_set (Union[str, dict, pandas.DataFrame], default:
None
): source containing test data. - training_set_metadata (Union[str, dict], default:
None
): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension. - data_format (str, default:
None
): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are'auto'
,'csv'
,'df'
,'dict'
,'excel'
,'feather'
,'fwf'
,'hdf5'
(cache file produced during previous training),'html'
(file containing a single HTML<table>
),'json'
,'jsonl'
,'parquet'
,'pickle'
(pickled Pandas DataFrame),'sas'
,'spss'
,'stata'
,'tsv'
. - experiment_name (str, default:
'experiment'
): name for the experiment. - model_name (str, default:
'run'
): name of the model that is being used. - model_resume_path (str, default:
None
): resumes training of the model from the path specified. The config is restored. In addition to config, training statistics and loss for epoch and the state of the optimizer are restored such that training can be effectively continued from a previously interrupted training process. - eval_split (str, default:
test
): split on which to perform evaluation. Valid values aretraining
,validation
andtest
. - skip_save_training_description (bool, default:
False
): disables saving the description JSON file. - skip_save_training_statistics (bool, default:
False
): disables saving training statistics JSON file. - skip_save_model (bool, default:
False
): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation metric improves, but if the model is really big that can be time consuming. If you do not want to keep the weights and just find out what performance a model can get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on and the returned model will have the weights obtained at the end of training, instead of the weights of the epoch with the best validation performance. - skip_save_progress (bool, default:
False
): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on. - skip_save_log (bool, default:
False
): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed. - skip_save_processed_input (bool, default:
False
): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter isFalse
, the HDF5 and JSON file are not saved. - skip_save_unprocessed_output (bool, default:
False
): by default predictions and their probabilities are saved in both raw unprocessed numpy files containing tensors and as postprocessed CSV files (one for each output feature). If this parameter is True, only the CSV ones are saved and the numpy ones are skipped. - skip_save_predictions (bool, default:
False
): skips saving test predictions CSV files - skip_save_eval_stats (bool, default:
False
): skips saving test statistics JSON file - skip_collect_predictions (bool, default:
False
): skips collecting post-processed predictions during eval. - skip_collect_overall_stats (bool, default:
False
): skips collecting overall stats during eval. - output_directory (str, default:
'results'
): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files. - random_seed (int: default: 42): random seed used for weights initialization, splits and any other random function.
Return
- return (Tuple[dict, dict, tuple, str)):
(evaluation_statistics, training_statistics, preprocessed_data, output_directory)
evaluation_statistics
dictionary with evaluation performance statistics on the test_set,training_statistics
is a nested dictionary of dataset -> feature_name -> metric_name -> List of metrics. Each metric corresponds to each training checkpoint.preprocessed_data
tuple containing preprocessed(training_set, validation_set, test_set)
,output_directory
filepath string to where results are stored.
forecast¶
forecast(
dataset,
data_format=None,
horizon=1,
output_directory=None,
output_format='parquet'
)
free_gpu_memory¶
free_gpu_memory(
)
Manually moves the model to CPU to force GPU memory to be freed.
For more context: https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530/35
is_merge_and_unload_set¶
is_merge_and_unload_set(
)
Check whether the encapsulated model is of type LLM and is configured to merge_and_unload QLoRA weights.
Return
:return (bool): whether merge_and_unload should be done.
load¶
load(
model_dir,
logging_level=40,
backend=None,
gpus=None,
gpu_memory_limit=None,
allow_parallel_threads=True,
callbacks=None
)
This function allows for loading pretrained models.
Inputs
- model_dir (str): path to the directory containing the model.
If the model was trained by the
train
orexperiment
command, the model is inresults_dir/experiment_dir/model
. - logging_level (int, default: 40): log level that will be sent to stderr.
- backend (Union[Backend, str]):
Backend
or string name of backend to use to execute preprocessing / training steps. - gpus (Union[str, int, List[int]], default:
None
): GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES) - gpu_memory_limit (float: default:
None
): maximum memory fraction [0, 1] allowed to allocate per GPU device. - allow_parallel_threads (bool, default:
True
): allow Torch to use multithreading parallelism to improve performance at the cost of determinism. - callbacks (list, default:
None
): a list ofludwig.callbacks.Callback
objects that provide hooks into the Ludwig pipeline.
Return
- return (LudwigModel): a LudwigModel object
Example usage
ludwig_model = LudwigModel.load(model_dir)
load_weights¶
load_weights(
model_dir
)
Loads weights from a pre-trained model.
Inputs
- model_dir (str): filepath string to location of a pre-trained model
Return
- return (
Non):
None`
Example usage
ludwig_model.load_weights(model_dir)
predict¶
ludwig.predict(
dataset=None,
data_format=None,
split='full',
batch_size=128,
generation_config=None,
skip_save_unprocessed_output=True,
skip_save_predictions=True,
output_directory='results',
return_type=<class 'pandas.core.frame.DataFrame'>,
callbacks=None
)
Using a trained model, make predictions from the provided dataset.
Inputs
- dataset (Union[str, dict, pandas.DataFrame]):: source containing the entire dataset to be evaluated.
- data_format (str, default:
None
): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are'auto'
,'csv'
,'df'
,'dict'
,'excel'
,'feather'
,'fwf'
,'hdf5'
(cache file produced during previous training),'html'
(file containing a single HTML<table>
),'json'
,'jsonl'
,'parquet'
,'pickle'
(pickled Pandas DataFrame),'sas'
,'spss'
,'stata'
,'tsv'
. - split (str, default=
'full'
):: if the input dataset contains a split column, this parameter indicates which split of the data to use. Possible values are'full'
,'training'
,'validation'
,'test'
. - batch_size (int, default: 128): size of batch to use when making predictions.
- generation_config (Dict, default:
None
): config for the generation of the predictions. IfNone
, the config that was used during model training is used. This is only used if the model type is LLM. Otherwise, this parameter is ignored. See Large Language Models under "Generation" for an example generation config. - skip_save_unprocessed_output (bool, default:
True
): if this parameter isFalse
, predictions and their probabilities are saved in both raw unprocessed numpy files containing tensors and as postprocessed CSV files (one for each output feature). If this parameter isTrue
, only the CSV ones are saved and the numpy ones are skipped. - skip_save_predictions (bool, default:
True
): skips saving test predictions CSV files. - output_directory (str, default:
'results'
): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files. - return_type (Union[str, dict, pandas.DataFrame], default: pd.DataFrame): indicates the format of the returned predictions.
- callbacks (Optional[List[Callback]], default: None): optional list of callbacks to use during this predict operation. Any callbacks already registered to the model will be preserved.
Return
:return (predictions, output_directory)
: (Tuple[Union[dict, pd.DataFrame], str])
predictions
predictions from the provided dataset,
output_directory
filepath string to where data was stored.
preprocess¶
preprocess(
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
data_format=None,
skip_save_processed_input=True,
random_seed=42
)
This function is used to preprocess data.
Args:
- dataset (Union[str, dict, pandas.DataFrame], default:
None
): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split. - training_set (Union[str, dict, pandas.DataFrame], default:
None
): source containing training data. - validation_set (Union[str, dict, pandas.DataFrame], default:
None
): source containing validation data. - test_set (Union[str, dict, pandas.DataFrame], default:
None
): source containing test data. - training_set_metadata (Union[str, dict], default:
None
): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension. - data_format (str, default:
None
): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are'auto'
,'csv'
,'df'
,'dict'
,'excel'
,'feather'
,'fwf'
,'hdf5'
(cache file produced during previous training),'html'
(file containing a single HTML<table>
),'json'
,'jsonl'
,'parquet'
,'pickle'
(pickled Pandas DataFrame),'sas'
,'spss'
,'stata'
,'tsv'
. - skip_save_processed_input (bool, default:
False
): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter isFalse
, the HDF5 and JSON file are not saved. - random_seed (int, default:
42
): a random seed that will be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling
Returns:
- __:return__: (PreprocessedDataset) data structure containing
(proc_training_set, proc_validation_set, proc_test_set, training_set_metadata)
.
Raises:
- RuntimeError: An error occured while preprocessing the data. Examples include training dataset being empty after preprocessing, lazy loading not being supported with RayBackend, etc.
save¶
save(
save_path
)
This function allows to save models on disk.
Inputs
- __ save_path__ (str): path to the directory where the model is going to be saved. Both a JSON file containing the model architecture hyperparameters and checkpoints files containing model weights will be saved.
Return
- return (None):
None
Example usage
ludwig_model.save(save_path)
save_config¶
save_config(
save_path
)
Save config to specified location.
Inputs
- save_path (str): filepath string to save config as a JSON file.
Return
- return (
None):
None`
save_torchscript¶
save_torchscript(
save_path,
model_only=False,
device=None
)
Saves the Torchscript model to disk.
Inputs
- save_path (str) (str):: The path to the directory where the model will be saved.
- model_only (bool, optional) (bool, optional):: If True, only the ECD model will be converted to Torchscript. Else, the preprocessing and postprocessing steps will also be converted to Torchscript.
- device (TorchDevice, optional) (TorchDevice, optional):: If None, the model will be converted to Torchscript on the same device to ensure maximum model parity.
Return
- return (
None):
None`
set_logging_level¶
set_logging_level(
logging_level
)
Sets level for log messages.
Inputs
- logging_level (int): Set/Update the logging level. Use logging
constants like
logging.DEBUG
,logging.INFO
andlogging.ERROR
.
Return
- return (
None):
None`
to_torchscript¶
to_torchscript(
model_only=False,
device=None
)
Converts the trained model to Torchscript.
Inputs
- __ model_only (bool, optional)__ (bool, optional):: If True, only the ECD model will be converted to Torchscript. Else, preprocessing and postprocessing steps will also be converted to Torchscript.
- device (TorchDevice, optional) (TorchDevice, optional):: If None, the model will be converted to Torchscript on the same device to ensure maximum model parity.
Returns
- return ( A torch.jit.ScriptModule that can be used to predict on a dictionary of inputs): A torch.jit.ScriptModule that can be used to predict on a dictionary of inputs.
train¶
train(
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
data_format=None,
experiment_name='api_experiment',
model_name='run',
model_resume_path=None,
skip_save_training_description=False,
skip_save_training_statistics=False,
skip_save_model=False,
skip_save_progress=False,
skip_save_log=False,
skip_save_processed_input=False,
output_directory='results',
random_seed=42
)
This function is used to perform a full training of the model on the specified dataset.
During training if the skip parameters are False
the model and statistics will be saved in a directory
[output_dir]/[experiment_name]_[model_name]_n
where all variables are
resolved to user specified ones and n
is an increasing number
starting from 0 used to differentiate among repeated runs.
Inputs
- dataset (Union[str, dict, pandas.DataFrame], default:
None
): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split. - training_set (Union[str, dict, pandas.DataFrame], default:
None
): source containing training data. - validation_set (Union[str, dict, pandas.DataFrame], default:
None
): source containing validation data. - test_set (Union[str, dict, pandas.DataFrame], default:
None
): source containing test data. - training_set_metadata (Union[str, dict], default:
None
): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension. - data_format (str, default:
None
): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are'auto'
,'csv'
,'df'
,'dict'
,'excel'
,'feather'
,'fwf'
,'hdf5'
(cache file produced during previous training),'html'
(file containing a single HTML<table>
),'json'
,'jsonl'
,'parquet'
,'pickle'
(pickled Pandas DataFrame),'sas'
,'spss'
,'stata'
,'tsv'
. - experiment_name (str, default:
'experiment'
): name for the experiment. - model_name (str, default:
'run'
): name of the model that is being used. - model_resume_path (str, default:
None
): resumes training of the model from the path specified. The config is restored. In addition to config, training statistics, loss for each epoch and the state of the optimizer are restored such that training can be effectively continued from a previously interrupted training process. - skip_save_training_description (bool, default:
False
): disables saving the description JSON file. - skip_save_training_statistics (bool, default:
False
): disables saving training statistics JSON file. - skip_save_model (bool, default:
False
): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation metric improves, but if the model is really big that can be time consuming. If you do not want to keep the weights and just find out what performance a model can get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on and the returned model will have the weights obtained at the end of training, instead of the weights of the epoch with the best validation performance. - skip_save_progress (bool, default:
False
): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on. - skip_save_log (bool, default:
False
): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed. - skip_save_processed_input (bool, default:
False
): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter isFalse
, the HDF5 and JSON file are not saved. - output_directory (str, default:
'results'
): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files. - random_seed (int, default:
42
): a random seed that will be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling - kwargs (dict, default: {}): a dictionary of optional parameters.
Return
- return (Tuple[Dict, Union[Dict, pd.DataFrame], str]): tuple containing
(training_statistics, preprocessed_data, output_directory)
.training_statistics
is a nested dictionary of dataset -> feature_name -> metric_name -> List of metrics. Each metric corresponds to each training checkpoint.preprocessed_data
is the tuple containing these three data sets(training_set, validation_set, test_set)
.output_directory
filepath to where training results are stored.
train_online¶
train_online(
dataset,
training_set_metadata=None,
data_format='auto',
random_seed=42
)
Performs one epoch of training of the model on dataset
.
Inputs
- dataset (Union[str, dict, pandas.DataFrame], default:
None
): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split. - training_set_metadata (Union[str, dict], default:
None
): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension. - data_format (str, default:
None
): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are'auto'
,'csv'
,'df'
,'dict'
,'excel'
,'feather'
,'fwf'
,'hdf5'
(cache file produced during previous training),'html'
(file containing a single HTML<table>
),'json'
,'jsonl'
,'parquet'
,'pickle'
(pickled Pandas DataFrame),'sas'
,'spss'
,'stata'
,'tsv'
. - random_seed (int, default:
42
): a random seed that is going to be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling
Return
- return (None):
None
upload_to_hf_hub¶
ludwig.upload_to_hf_hub(
repo_id,
model_path,
repo_type='model',
private=False,
commit_message='Upload trained [Ludwig](https://ludwig.ai/latest/) model weights',
commit_description=None
)
Uploads trained model artifacts to the HuggingFace Hub.
Inputs
- repo_id (
str
) (str
):: A namespace (user or an organization) and a repo name separated by a/
. - model_path (
str
) (str
):: The path of the saved model. This is the top level directory where the models weights as well as other associated training artifacts are saved. - private (
bool
, optional, defaults toFalse
) (bool
, optional, defaults toFalse
):: Whether the model repo should be private. - repo_type (
str
, optional) (str
, optional):: Set to"dataset"
or"space"
if uploading to a dataset or space,None
or"model"
if uploading to a model. Default isNone
. - commit_message (
str
, optional) (str
, optional):: The summary / title / first line of the generated commit. Defaults to:f"Upload {path_in_repo} with huggingface_hub"
- commit_description (
str
optional) (str
optional):: The description of the generated commit
Returns
- return (bool): True for success, False for failure.
Module functions¶
kfold_cross_validate¶
ludwig.api.kfold_cross_validate(
num_folds,
config,
dataset=None,
data_format=None,
skip_save_training_description=False,
skip_save_training_statistics=False,
skip_save_model=False,
skip_save_progress=False,
skip_save_log=False,
skip_save_processed_input=False,
skip_save_predictions=False,
skip_save_eval_stats=False,
skip_collect_predictions=False,
skip_collect_overall_stats=False,
output_directory='results',
random_seed=42,
gpus=None,
gpu_memory_limit=None,
allow_parallel_threads=True,
backend=None,
logging_level=20
)
Performs k-fold cross validation and returns result data structures.
Inputs
- num_folds (int): number of folds to create for the cross-validation
- config (Union[dict, str]): model specification required to build a model. Parameter may be a dictionary or string specifying the file path to a yaml configuration file. Refer to the User Guide for details.
- dataset (Union[str, dict, pandas.DataFrame], default:
None
): source containing the entire dataset to be used for k_fold processing. - data_format (str, default:
None
): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are'auto'
,'csv'
,'df'
,'dict'
,'excel'
,'feather'
,'fwf'
,'html'
(file containing a single HTML<table>
),'json'
,'jsonl'
,'parquet'
,'pickle'
(pickled Pandas DataFrame),'sas'
,'spss'
,'stata'
,'tsv'
. Currentlyhdf5
format is not supported for k_fold cross validation. - skip_save_training_description (bool, default:
False
): disables saving the description JSON file. - skip_save_training_statistics (bool, default:
False
): disables saving training statistics JSON file. - skip_save_model (bool, default:
False
): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation metric improves, but if the model is really big that can be time consuming. If you do not want to keep the weights and just find out what performance a model can get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on and the returned model will have the weights obtained at the end of training, instead of the weights of the epoch with the best validation performance. - skip_save_progress (bool, default:
False
): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on. - skip_save_log (bool, default:
False
): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed. - skip_save_processed_input (bool, default:
False
): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter isFalse
, the HDF5 and JSON file are not saved. - skip_save_predictions (bool, default:
False
): skips saving test predictions CSV files. - skip_save_eval_stats (bool, default:
False
): skips saving test statistics JSON file. - skip_collect_predictions (bool, default:
False
): skips collecting post-processed predictions during eval. - skip_collect_overall_stats (bool, default:
False
): skips collecting overall stats during eval. - output_directory (str, default:
'results'
): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files. - random_seed (int, default:
42
): Random seed used for weights initialization, splits and any other random function. - gpus (list, default:
None
): list of GPUs that are available for training. - gpu_memory_limit (float: default:
None
): maximum memory fraction [0, 1] allowed to allocate per GPU device. - allow_parallel_threads (bool, default:
True
): allow Torch to use multithreading parallelism to improve performance at the cost of determinism. - backend (Union[Backend, str]):
Backend
or string name of backend to use to execute preprocessing / training steps. - logging_level (int, default: INFO): log level to send to stderr.
Return
- return (tuple(kfold_cv_statistics, kfold_split_indices), dict): a tuple of
dictionaries
kfold_cv_statistics
: contains metrics from cv run.kfold_split_indices
: indices to split training data into training fold and test fold.
PublicAPI: This API is stable across Ludwig releases.
hyperopt¶
ludwig.hyperopt.run.hyperopt(
config,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
data_format=None,
experiment_name='hyperopt',
model_name='run',
resume=None,
skip_save_training_description=False,
skip_save_training_statistics=False,
skip_save_model=False,
skip_save_progress=False,
skip_save_log=False,
skip_save_processed_input=True,
skip_save_unprocessed_output=False,
skip_save_predictions=False,
skip_save_eval_stats=False,
skip_save_hyperopt_statistics=False,
output_directory='results',
gpus=None,
gpu_memory_limit=None,
allow_parallel_threads=True,
callbacks=None,
tune_callbacks=None,
backend=None,
random_seed=42,
hyperopt_log_verbosity=3
)
This method performs an hyperparameter optimization.
Inputs
- config (Union[str, dict]): config which defines
the different parameters of the model, features, preprocessing and
training. If
str
, filepath to yaml configuration file. - dataset (Union[str, dict, pandas.DataFrame], default:
None
): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split. - training_set (Union[str, dict, pandas.DataFrame], default:
None
): source containing training data. - validation_set (Union[str, dict, pandas.DataFrame], default:
None
): source containing validation data. - test_set (Union[str, dict, pandas.DataFrame], default:
None
): source containing test data. - training_set_metadata (Union[str, dict], default:
None
): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension. - data_format (str, default:
None
): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are'auto'
,'csv'
,'df'
,'dict'
,'excel'
,'feather'
,'fwf'
,'hdf5'
(cache file produced during previous training),'html'
(file containing a single HTML<table>
),'json'
,'jsonl'
,'parquet'
,'pickle'
(pickled Pandas DataFrame),'sas'
,'spss'
,'stata'
,'tsv'
. - experiment_name (str, default:
'experiment'
): name for the experiment. - model_name (str, default:
'run'
): name of the model that is being used. - resume (bool): If true, continue hyperopt from the state of the previous run in the output directory with the same experiment name. If false, will create new trials, ignoring any previous state, even if they exist in the output_directory. By default, will attempt to resume if there is already an existing experiment with the same name, and will create new trials if not.
- skip_save_training_description (bool, default:
False
): disables saving the description JSON file. - skip_save_training_statistics (bool, default:
False
): disables saving training statistics JSON file. - skip_save_model (bool, default:
False
): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation metric improves, but if the model is really big that can be time consuming. If you do not want to keep the weights and just find out what performance a model can get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on and the returned model will have the weights obtained at the end of training, instead of the weights of the epoch with the best validation performance. - skip_save_progress (bool, default:
False
): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on. - skip_save_log (bool, default:
False
): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed. - skip_save_processed_input (bool, default:
False
): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter isFalse
, the HDF5 and JSON file are not saved. - skip_save_unprocessed_output (bool, default:
False
): by default predictions and their probabilities are saved in both raw unprocessed numpy files containing tensors and as postprocessed CSV files (one for each output feature). If this parameter is True, only the CSV ones are saved and the numpy ones are skipped. - skip_save_predictions (bool, default:
False
): skips saving test predictions CSV files. - skip_save_eval_stats (bool, default:
False
): skips saving test statistics JSON file. - skip_save_hyperopt_statistics (bool, default:
False
): skips saving hyperopt stats file. - output_directory (str, default:
'results'
): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files. - gpus (list, default:
None
): list of GPUs that are available for training. - gpu_memory_limit (float: default:
None
): maximum memory fraction [0, 1] allowed to allocate per GPU device. - allow_parallel_threads (bool, default:
True
): allow PyTorch to use multithreading parallelism to improve performance at the cost of determinism. - callbacks (list, default:
None
): a list ofludwig.callbacks.Callback
objects that provide hooks into the Ludwig pipeline. - backend (Union[Backend, str]):
Backend
or string name of backend to use to execute preprocessing / training steps. - random_seed (int: default: 42): random seed used for weights initialization, splits and any other random function.
- hyperopt_log_verbosity (int: default: 3): controls verbosity of ray tune log messages. Valid values: 0 = silent, 1 = only status updates, 2 = status and brief trial results, 3 = status and detailed trial results.
Return
- return (List[dict]): List of results for each trial, ordered by descending performance on the target metric.