Run Tests on GPU Using Ray

As part of Ludwig's GitHub actions PR checks, all Ludwig tests must pass with and without GPU availability.

To debug a specific test on GPU, it may be useful to run Ludwig GPU tests using Ray.


1. Set up an AWS AMI with a GPU

Reach out to your AWS account administrator, or set up an account for yourself.

2. Test if you have the AWS CLI

aws s3 ls

If not, install it from here.

3. Set up AWS keys

  1. AWS Credentials [you will need to set this up for Ray to authenticate you]

    How to create AWS Access Key ID

    Once created, download your access key so you can refer to it.

  2. Run aws configure to configure your AWS CLI with your access credentials

    Configuration and credential file settings - AWS Command Line Interface

  3. (optional) Get an AWS PEM file

    Not needed for unit tests on GPU, which never spins up new nodes, but it will be needed if you ever want to enable Ray to launch new nodes.

    Amazon EC2 key pairs and Linux instances - Amazon Elastic Compute Cloud

4. Get Ray

Install ray locally:

pip install -U "ray[default]" boto3

5. Set up a Ray Config

vim $HOME/.clusters/cluster.yaml

Copy the sample ray config below and edit all the <...> values to match your local dev environment.

cluster_name: <$USER>-ludwig-ray-g4dn

max_workers: 3

  image: "ludwigai/ludwig-ray-gpu:master"
  container_name: "ray_container"
  pull_before_run: True
  run_options: # Extra options to pass into "docker run"
    - --ulimit nofile=65536:65536

  type: aws
  region: <us-east-2>
  availability_zone: <us-east-2a>

    resources: {}
      InstanceType: g4dn.4xlarge
      ImageId: latest_dlami
        - DeviceName: /dev/sda1
            VolumeSize: 100
    min_workers: 0
    max_workers: 0
    resources: {}
      InstanceType: g4dn.4xlarge
      ImageId: latest_dlami

head_node_type: ray.head.default

    /home/ubuntu/ludwig/: </Users/$USER/ludwig>,  # Ludwig Repo.
    /home/ray/.aws: </Users/$USER/.aws>,  # AWS credentials.

  - "**/.git"
  - "**/.git/**"

  - ".gitignore"

  - pip uninstall -y ludwig && pip install -e /home/ubuntu/ludwig/.
  - pip install s3fs==2021.10.0 aiobotocore==1.4.2 boto3==1.17.106
  - pip install pandas==1.1.4
  - pip install hydra-core --upgrade

  - ray stop --force
  - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

  - ray stop --force
  - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Set an environment variable mapping to the location (can be relative) of your cluster config:

export CLUSTER="$HOME/.clusters/cluster.yaml"

Developer Workflow

(once) Launch the ray cluster

export CLUSTER="$HOME/cluster_g4dn.yaml" export CLUSTER_CPU="$HOME/cluster_cpu.yaml" ray up $CLUSTER

Make local changes

Run tests locally.

pytest tests/...

Rsync your local changes to the ray GPU cluster

ray rsync_up $CLUSTER -A '/Users/$USER/ludwig/' '/home/ubuntu/ludwig'
ray rsync_up $CLUSTER_CPU -A '/Users/$USER/ludwig/' '/home/ubuntu/ludwig'


The trailing backslash / is important!

Run tests on the GPU cluster from the Ray-mounted ludwig directory

ray exec $CLUSTER "cd /home/ubuntu/ludwig && pytest tests/"

You can also connect directly to a terminal on the cluster head:

ray attach $CLUSTER