Hyperopt
In order to perform hyper-parameter optimization, its configuration has to be provided inside the Ludwig configuration as a root key hyperopt.
Its configuration contains what metric to optimize, which parameters to optimize, which sampler to use, and how to execute the optimization.
The different parameters that could be defined in the hyperopt configuration are:
goalwhich indicates if to minimize or maximize a metric or a loss of any of the output features on any of the dataset splits. Available values are:minimize(default) ormaximize.output_featureis astrcontaining the name of the output feature that we want to optimize the metric or loss of. Available values arecombined(default) or the name of any output feature provided in the configuration.combinedis a special output feature that allows to optimize for the aggregated loss and metrics of all output features.metricis the metric that we want to optimize for. The default one isloss, but depending on the type of the feature defined inoutput_feature, different metrics and losses are available. Check the metrics section of the specific output feature type to figure out what metrics are available to use.splitis the split of data that we want to compute our metric on. By default it is thevalidationsplit, but you have the flexibility to specify alsotrainortestsplits.parameterssection consists of a set of hyper-parameters to optimize. They are provided as keys (the names of the parameters) and values associated with them (that define the search space). The values vary depending on the type of the hyper-parameter. Types can befloat,intandcategory.samplersection contains the sampler type to be used for sampling hyper-paramters values and its configuration. Currently available sampler types aregridandrandom. The sampler configuration parameters modify the sampler behavior, for instance forrandomyou can set how many random samples to draw.executorsection specifies how to execute the hyper-parameter optimization. The execution could happen locally in a serial manner or in parallel across multiple workers and with GPUs as well if available.
Example:
hyperopt:
goal: minimize
output_feature: combined
metric: loss
split: validation
parameters:
utterance.cell_type: ...
utterance.num_layers: ...
combiner.num_fc_layers: ...
section.embedding_size: ...
preprocessing.text.vocab_size: ...
training.learning_rate: ...
training.optimizer.type: ...
...
sampler:
type: grid # random, ...
# sampler parameters...
executor:
type: serial # parallel, ...
# executor parameters...
In the parameters section, . is used to reference an parameter nested inside a section of the configuration.
For instance, to reference the learning_rate, one would have to use the name training.learning_rate.
If the parameter to reference is inside an input or output feature, the name of that feature will be be used as starting point.
For instance, for referencing the cell_type of the utterance feature, use the name utterance.cell_type.
Hyper-parameters¶
Float parameters¶
For a float value, the parameters to specify are:
low: the minimum value the parameter can havehigh: the maximum value the parameter can havescale:linear(default) orlogsteps: OPTIONAL number of steps.
For instance range: (0.0, 1.0), steps: 3 would yield [0.0, 0.5, 1.0] as potential values to sample from, while if steps is not specified, the full range between 0.0 and 1.0 will be used.
Example:
training.learning_rate:
type: real
low: 0.001
high: 0.1
steps: 4
scale: linear
Int parameters¶
For an int value, the parameters to specify are:
low: the minimum value the parameter can havehigh: the maximum value the parameter can havesteps: OPTIONAL number of steps.
For instance range: (0, 10), steps: 3 would yield [0, 5, 10] for the search, while if steps is not specified, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] will be used.
Example:
combiner.num_fc_layers:
type: int
low: 1
high: 4
Category parameters¶
For a category value, the parameters to specify are:
values: a list of possible values. The type of each value of the list is not important (they could be strings, integers, floats and anything else, even entire dictionaries).
Example:
utterance.cell_type:
type: category
values: [rnn, gru, lstm]
Sampler¶
Grid sampler¶
The grid sampler creates a search space by exhaustively selecting all elements from the outer product of all possible values of the hyper-parameters provided in the parameters section.
For float parameters, it is required to specify the number of steps.
Example:
sampler:
type: grid
Random sampler¶
The random sampler samples hyper-parameter values randomly from the parameters search space.
num_samples (default: 10) can be specified in the sampler section.
Example:
sampler:
type: random
num_samples: 10
PySOT sampler¶
The pysot sampler uses the pySOT package for asynchronous surrogate optimization.
This package implements many popular methods from Bayesian optimization and surrogate optimization.
By default, pySOT uses the Stochastic RBF (SRBF) method by Regis and Shoemaker.
SRBF starts by evaluating a symmetric Latin hypercube design of size 2 * d + 1, where d is the number of hyperparameters that are optimized.
When these points have been evaluated, SRBF fits a radial basis function surrogate and uses this surrogate together with an acquisition function to select the next sample(s).
We recommend using at least 10 * d total samples to allow the algorithm to converge.
More details are available on the GitHub page: https://github.com/dme65/pySOT.
Example:
sampler:
type: pysot
num_samples: 10
Ray Tune sampler¶
The ray sampler is used in conjunction with the ray executor to enable Ray Tune for distributed hyperopt across a cluster of machines.
Ray Tune supports its own collection of search algorithms, specified by the search_alg section of the sampler config:
sampler:
type: ray
search_alg:
type: ax
You can find the full list of supported search algorithm names in Ray Tune's create_searcher function.
Ray Tune also allows you to specify a scheduler to support features like early stopping and other population-based strategies that may pause and resume trials during training. Ludwig exposes the complete scheduler API in the scheduler section of the config:
sampler:
type: ray
search_alg:
type: bohb
scheduler:
type: hb_bohb
time_attr: training_iteration
reduction_factor: 4
You can find the full list of supported schedulers in Ray Tune's create_scheduler function.
Other config options, including parameters, num_samples, and goal work the same for Ray Tune as they do for other sampling strategies in Ludwig. The parameters will be converted from the Ludwig format into a Ray Tune search space. However, note that the space field of the Ludwig config should conform to the Ray Tune distribution names. For example:
hyperopt:
parameters:
training.learning_rate:
space: loguniform
lower: 0.001
upper: 0.1
combiner.num_fc_layers:
space: randint
lower: 2
upper: 6
utterance.cell_type:
space: grid_search
values: ["rnn", "gru"]
utterance.bidirectional:
space: choice
categories: [True, False]
utterance.fc_layers:
space: choice
categories:
- [{"fc_size": 512}, {"fc_size": 256}]
- [{"fc_size": 512}]
- [{"fc_size": 256}]
goal: minimize
Executor¶
Serial Executor¶
The serialexecutor performs hyper-parameter optimization locally in a serial manner, executing the elements in the set of sampled parameters obtained by the selected sampler one at a time.
Example:
executor:
type: serial
Parallel Executor¶
The parallel executor performs hyper-parameter optimization in parallel, executing the elements in the set of sampled parameters obtained by the selected sampler at the same time.
The maximum numer of parallel workers that train and evaluate models is defined by the parameter num_workers (default: 2).
In case of training with GPUs, the gpus argument provided to the command line interface contains the list of GPUs to use, while if no gpus parameter is provided, all available GPUs will be used.
The gpu_fraction argument can be provided as well, but it gets modified according to the num_workers to execute tasks parallely.
For example, if num_workers: 4 and 2 GPUs are available, if the provided gpu_fraction is above 0.5, if will be replaced by 0.5.
An epsilon (default: 0.01) parameter is also provided to allow for additional free GPU memory: the GPU franction to use is defined as (#gpus / #workers) - epsilon.
Example:
executor:
type: parallel
num_workers: 2
epsilon: 0.01
Ray Tune Executor¶
The ray executor is used in conjunction with the ray sampler to enable Ray Tune for distributed hyperopt across a cluster of machines.
Parameters:
cpu_resources_per_trial: The number of CPU cores allocated to each trial (default: 1).gpu_resources_per_trial: The number of GPU devices allocated to each trial (default: 0).kubernetes_namespace: When running on Kubernetes, provide the namespace of the Ray cluster to sync results between pods. See the Ray docs for more info.
Example:
executor:
type: ray
cpu_resources_per_trial: 2
gpu_resources_per_trial: 1
kubernetes_namespace: ray
Running Ray Executor:
See the section on Running Ludwig with Ray for guidance on setting up your Ray cluster.
Fiber Executor¶
Fiber is a Python distributed computing library for modern computer clusters.
The fiber executor performs hyper-parameter optimization in parallel on a computer cluster so that massive parallelism can be achieved.
Check this for supported cluster types.
Fiber Executor requires fiber to be installed:
pip install fiber
Parameters:
num_workers: The number of parallel workers that is used to train and evaluate models. The default value is2.num_cpus_per_worker: How many CPU cores are allocated per worker.num_gpus_per_worker: How many GPUs are allocated per worker.fiber_backend: Fiber backend to use. This needs to be set if you want to run hyper-parameter optimization on a cluster. The default value islocal. Available values arelocal,kubernetes,docker. Check Fiber's documentation for details on the supported platforms.
Example:
executor:
type: fiber
num_workers: 10
fiber_backend: kubernetes
num_cpus_per_worker: 2
num_gpus_per_worker: 1
Running Fiber Executor:
Fiber runs on a computer cluster and uses Docker to encapsulate all the code and dependencies. To run a hyper-parameter search powered by Fiber, you have to create a Docker file to encapsulate your code and dependencies.
Example Dockerfile:
FROM tensorflow/tensorflow:1.15.2-gpu-py3
RUN apt-get -y update && apt-get -y install git libsndfile1
RUN git clone --depth=1 https://github.com/ludwig-ai/ludwig.git
RUN cd ludwig/ \
&& pip install -r requirements.txt -r requirements_text.txt \
-r requirements_image.txt -r requirements_audio.txt \
-r requirements_serve.txt -r requirements_viz.txt \
&& python setup.py install
RUN pip install fiber
RUN mkdir /data
ADD train.csv /data/data.csv
ADD hyperopt.yaml /data/hyperopt.yaml
WORKDIR /data
In this Dockerfile, the data data.csv is embedded in the docker together with hyperopt.yaml that specifies the model and hyper-parameter optimization parameters.
If your data is too big to be added directly in the docker image, refer to the Fiber's documentation for instructions on how to work with shared persistent storage for Fiber workers.
An example hyperopt.yaml looks like:
input_features:
-
name: x
type: numerical
output_features:
-
name: y
type: category
training:
epochs: 1
hyperopt:
sampler:
type: random
num_samples: 50
executor:
type: fiber
num_workers: 10
fiber_backend: kubernetes
num_cpus_per_worker: 2
num_gpus_per_worker: 1
parameters:
training.learning_rate:
type: float
low: 0.0001
high: 0.1
y.num_fc_layers:
type: int
low: 0
high: 2
Running hyper-parameter optimization with Fiber is a little bit different from other executors because there is docker building and pushing involved, so the fiber run command, which takes care of those aspects, is used to run hyper-parameter optimization on a cluster:
fiber run ludwig hyperopt --dataset train.csv -cf hyperopt.yaml
Check out Fiber's documentation for more details on running on clusters.
Full hyper-parameter optimization example¶
Example YAML:
input_features:
-
name: utterance
type: text
encoder: rnn
cell_type: lstm
num_layers: 2
-
name: section
type: category
representation: dense
embedding_size: 100
combiner:
type: concat
num_fc_layers: 1
output_features:
-
name: class
type: category
preprocessing:
text:
word_vocab_size: 10000
training:
learning_rate: 0.001
optimizer:
type: adam
hyperopt:
goal: maximize
output_feature: class
metric: accuracy
split: validation
parameters:
training.learning_rate:
type: float
low: 0.0001
high: 0.1
steps: 4
scale: log
training.optimizaer.type:
type: category
values: [sgd, adam, adagrad]
preprocessing.text.word_vocab_size:
type: int
low: 700
high: 1200
steps: 5
combiner.num_fc_layers:
type: int
low: 1
high: 5
utterance.cell_type:
type: category
values: [rnn, gru, lstm]
sampler:
type: random
num_samples: 12
executor:
type: parallel
num_workers: 4
Example CLI command:
ludwig hyperopt --dataset reuters-allcats.csv --config "{input_features: [{name: utterance, type: text, encoder: rnn, cell_type: lstm, num_layers: 2}], output_features: [{name: class, type: category}], training: {learning_rate: 0.001}, hyperopt: {goal: maximize, output_feature: class, metric: accuracy, split: validation, parameters: {training.learning_rate: {type: float, low: 0.0001, high: 0.1, steps: 4, scale: log}, utterance.cell_type: {type: category, values: [rnn, gru, lstm]}}, sampler: {type: grid}, executor: {type: serial}}}"