Backend

The same Ludwig config / Python code that runs on your local machine can be executed remotely in a distributed manner with zero code changes. This distributed execution includes preprocessing, training, and batch prediction.

In most cases, Ludwig will be able to automatically detect if you're running in an environment that supports distributed execution, but you can also make this explicit on the command line with the --backend arg or by providing a backend section to the Ludwig config YAML:

backend:
  type: ray
  cache_dir: s3://my_bucket/cache
  cache_credentials: /home/user/.credentials.json
  processor:
    type: dask
  trainer:
    strategy: horovod
  loader: {}

Parameters:

type: How the job will be distributed, one of local, ray, horovod.
cache_dir: Where the preprocessed data will be written on disk, defaults to the location of the input dataset. See Cloud Storage for more details
cache_credentials: Optional dictionary of credentials (or path to credential JSON file) used to write to the cache. See Cloud Storage for more details
processor: (Ray only) parameters to configure execution of distributed data processing.
trainer: (Ray only) parameters to configure execution of distributed training.
loader: (Ray only) parameters to configure data loading from processed data to training batches.

Processor¶

The processor section configures distributed data processing. The local backend uses the Pandas dataframe library, which runs in a single process with the entire datasets in memory. To make the data processing scalable to large datasets, we support two distributed dataframe libraries with the ray backend:

dask: (default) a lazily executed version of distributed Pandas.
modin: an eagerly executed version of distributed Pandas.

Dask¶

Dask is the default distributed data processing library when using Ludwig on Ray. It executes distributed Pandas operations on partitions of the data in parallel. One beneficial property of Dask is that it is executed lazily, which allows it to stream very large datasets without needing to hold the entire dataset in distributed memory at once.

One downside to Dask is that it can require some tuning to get the best performance. There are two knobs we expose in Ludwig for tuning Dask:

parallelism: the number of partitions to divide the dataset into (defaults to letting Dask figure this out automatically).
persist: whether intermediate stages of preprocessing should be cached in distributed memory (default: true).

Increasing parallelism can reduce memory pressure during preprocessing for large datasets and increase parallelism (horizontal scaling). The downside to too much parallelism is that there is some overhead for each partition-level operation (serialization and deserialization), which can dominate the runtime if set too high.

Setting persist to false can be useful if the dataset is too large for all the memory and disk of the entire Ray cluster. Only set this to false if you're seeing issues running out of memory or disk space.

Example:

backend:
  type: ray
  processor:
    type: dask
    parallelism: 100
    persist: true

Modin¶

Modin is an eagerly-executed distributed dataframe library that closely mirrors the behavior of Pandas. Because it behaves almost identically to Pandas but is able to distribute the dataset across the Ray cluster, there are fewer things to configure to optimize its performance.

Support for Modin is currently experimental.

Example:

backend:
  type: ray
  processor:
    type: modin

Trainer¶

The trainer section configures distributed training strategy. Currently we support the following strategies (described in detail below) with more coming in the future:

Horovod
Distributed Data Parallel (DDP)
Fully Sharded Data Parallel (FSDP)

The following parameters can be configured for any distributed strategy:

strategy: one of horovod, ddp, or fsdp.
use_gpu: whether to use GPUs for training (defaults to true when the cluster has at least one GPU).
num_workers: how many Horovod workers to use for training (defaults to the number of GPUs, or 1 if no GPUs are found).
resources_per_worker: the Ray resources to assign to each Horovod worker (defaults to 1 CPU and 1 GPU if available).
logdir: path to the file directory where logs should be persisted.
max_retries: number of retries when Ray actors fail (defaults to 3).

See the Ray Train API for more details on these parameters.

Example:

backend:
  type: ray
  trainer:
    strategy: horovod
    use_gpu: true
    num_workers: 4
    resources_per_worker:
        CPU: 2
        GPU: 1

In most cases, you shouldn't need to set these values explicitly, as Ludwig will set them on your behalf to maximize the resources available in the cluster. There are two cases in which it makes sense to set these values explicitly, however:

Your Ray cluster makes use of autoscaling (in which case Ludwig will not know how many GPUs are available with scaling).
Your Ray cluster is running multiple workloads at once (in which case you may wish to reserve resources for other jobs).

Horovod¶

Horovod is a distributed data-parallel framework that is optimized for bandwidth-constrained computing environments. It makes use of Nvidia's NCCL for fast GPU-to-GPU communication.

Example:

backend:
  type: ray
  trainer:
    strategy: horovod

Distributed Data Parallel (DDP)¶

Distributed Data Parallel or DDP is PyTorch's native data-parallel library that functions very similarly to Horovod, but does not require installing any additional packages to use.

In benchmarks, we found DDP and Horovod to perform near identically, so if you're not already using Horovod, DDP is the easiest way to get started with distributed training in Ludwig.

backend:
  type: ray
  trainer:
    strategy: ddp

Fully Sharded Data Parallel (FSDP)¶

Fully Sharded Data Parallel or FSDP is PyTorch's native data-parallel + model-parallel library for training very large models whose parameters are spread across multiple GPUs.

The primary scenario to use FSDP is when the model you're training is too large to fit into a single GPU (e.g., fine-tuning a large language model like BLOOM)). When the model is small enough to fit in a single GPU, however, benchmarking has shown it's generally better to use Horovod or DDP.

backend:
  type: ray
  trainer:
    strategy: fsdp

Loader¶

The loader section configures the "last mile" data ingest from processed data (typically cached in the Parquet format) to tensor batches used for training the model.

When training a deep learning model at scale, the chain is only as strong as its weakest link -- in other words, the data loading pipeline needs to be at least as fast as the GPU forward / backward passes, otherwise the whole process will be bottlenecked by the data loader.

In most cases, Ludwig's defaults will be sufficient to get good data loading performance, but if you notice GPU utilization dropping even after scaling the batch size, or going long periods at 0% utilization before spiking up again, you may need to tune the loader parameters below to improve throughput:

fully_executed: Force full evaluation of the preprocessed dataset by loading all blocks into cluster memory / storage (defaults to true). Disable this if the dataset is much larger than the total amount of cluster memory allocated to the Ray object store and you notice that object spilling is occurring frequently during training.
window_size_bytes: Load and shuffle the preprocessed dataset in discrete windows of this size (defaults to null, meaning data will not be windowed). Try configuring this is if shuffling is taking a very long time, indicated by every epoch of training taking many minutes to start. In general, larger window sizes result in more uniform shuffling (which can lead to better model performance in some cases), while smaller window sizes will be faster to load. This setting is particularly useful when running hyperopt over a large dataset.

Example:

backend:
  type: ray
  loader:
    fully_executed: false
    window_size_bytes: 500000000