Distributed training on Ray
Ludwig has strong support for Ray, a framework for distributed computing that makes it easy to scale up code that runs on your local machine to execute in parallel across a cluster of machines.
Let's spin up a ray cluster, so we can try out distributed training and hyperparameter tuning in parallel. Make sure you have access to an AWS EC2 node provider.
First install the Ray Cluster Launcher:
pip install ray
Next let's make a configuration file named cluster.yaml
for the Ray Cluster:
cluster_name: ludwig-ray-gpu-latest
min_workers: 4
max_workers: 4
docker:
image: "ludwigai/ludwig-ray-gpu:latest"
container_name: "ray_container"
head_node:
InstanceType: m5.2xlarge
ImageId: latest_dlami
worker_nodes:
InstanceType: g4dn.2xlarge
ImageId: latest_dlami
Finally, you can spin up the cluster with the following command:
ray up cluster.yaml
In order to run a distributed training job, make sure you have your dataset stored in an S3 bucket, and run this command:
ray submit cluster.yaml ludwig train --config rotten_tomatoes.yaml --dataset s3://mybucket/rotten_tomatoes.csv
You can also run a distributed hyperopt job with this command:
ray submit cluster.yaml ludwig hyperopt --config rotten_tomatoes.yaml --dataset s3://mybucket/rotten_tomatoes.csv
For more information on using Ray with Ludwig, refer to the ray configuration guide.