Cloud Storage
Cloud object storage systems like Amazon S3 are useful for working with large datasets or running on a cluster of machines for distributed training. Ludwig provides out-of-the-box support for reading and writing to remote systems through fsspec.
Example:
ludwig train \
--dataset s3://my_datasets/subdir/dataset.parquet \
--output_directory s3://my_experiments/foo
Environment Setup¶
The sections below cover how to read and write between your preferred remote filesystem in Ludwig.
Amazon S3¶
Install filesystem driver in your Docker image: pip install s3fs
.
Mount your $HOME/.aws/credentials
file into the container or set the following enviroment variables:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
Refer to paths with protocol s3://
.
MinIO¶
MinIO uses the same protocol as s3, but requires an additional environment variable to be set:
AWS_ENDPOINT_URL
Google Cloud Storage¶
Install filesystem driver in your Docker image: pip install gcsfs
.
Generate a token as described here.
Mount the token file into the container at one of the locations described in the gcsfs
docs.
Refer to paths with protocol gs://
or gcs://
.
Azure Storage¶
Install filesystem driver in your Docker image: pip install adlfs
.
Mount your $HOME/.aws/credentials
file into the container or set the following enviroment variables:
AZURE_STORAGE_CONNECTION_STRING
See adlfs
docs for more details.
Refer to paths with protocol az://
or abfs://
.
Additional Configuration¶
Remote Dataset Cache¶
Often your input datasets will be in a read-only location such as a shared data lake. In these cases, you won't want to
rely on Ludwig's default caching behavior of writing to the same base directory as the input dataset. Instead, you can configure
Ludwig to write to a dedicated cache directory / bucket by configuring the backend
section of the config:
backend:
cache_dir: "s3://ludwig_cache"
Individual entries will be written using a filename computed from the checksum of the dataset and Ludwig config used for training.
One additional benefit of setting up a dedicated cache is to make use of cache eviction policies. For example, setting up a TTL so cached datasets are automatically cleaned up after a few days.
Using different cache and dataset filesystems¶
In some cases you may want your dartaset cache to reside in a different filesystem or account than your input dataset. Since this requires maintaing two sets of credentials -- one of the input data and one for the cache -- Ludwig provides additional configuration options for the cache credentials.
Credentials can be provided explicitly in the config:
backend:
cache_dir: "s3://ludwig_cache"
cache_credentials:
s3:
client_kwargs:
aws_access_key_id: "test"
aws_secret_access_key: "test"
Or in a mounted file for additional security:
backend:
cache_dir: "s3://ludwig_cache"
cache_credentials: /home/user/.credentials.json