Ray Serve and KServe

Ray Serve and KServe Deployment¶

Ludwig models can be deployed to production with:

Ray Serve — autoscaling multi-replica deployment on a Ray cluster
KServe — Kubernetes-native serving via the Open Inference Protocol v2

Both expose the same /predict payload interface as the built-in FastAPI server, so your client code works unchanged across local, Ray Serve, and KServe deployments.

For general serving options (local FastAPI, vLLM) see Serving Ludwig Models.

Ray Serve¶

Install¶

pip install "ludwig[distributed]"  # includes ray[serve]

Deploy¶

Python APICLI (deploy.py)

import ray
from ray import serve
from ludwig.serve_ray_serve import deploy_ludwig_model

ray.init(ignore_reinit_error=True)

handle = deploy_ludwig_model(
    model_path="results/experiment_run/model",
    name="sentiment",
    num_replicas=2,
    ray_actor_options={"num_gpus": 1},  # omit for CPU
)

print("Deployment live at: http://localhost:8000/sentiment")

# Two GPU replicas on port 8080
python examples/serving/ray_serve/deploy.py \
    --model_path ./results/my_model \
    --name sentiment \
    --num_replicas 2 \
    --gpu \
    --port 8080 \
    --block

Send predictions¶

# Single record
curl -X POST http://localhost:8000/sentiment \
  -H "Content-Type: application/json" \
  -d '{"text": "I love this product!", "stars": 5}'

# Batch (list of records)
curl -X POST http://localhost:8000/sentiment \
  -H "Content-Type: application/json" \
  -d '[{"text": "great"}, {"text": "terrible"}]'

Python client:

import httpx
import json

url = "http://localhost:8000/sentiment"

# Single record
response = httpx.post(url, json={"text": "Amazing product, highly recommend!"})
print(response.json())

# Batch
records = [{"text": "great"}, {"text": "terrible"}, {"text": "okay"}]
response = httpx.post(url, json=records)
print(response.json())

Autoscaling¶

Ray Serve supports autoscaling out of the box. Modify deploy_ludwig_model to pass autoscaling_config:

from ray.serve.config import AutoscalingConfig
from ludwig.serve_ray_serve import make_ludwig_deployment_class

DeploymentClass = make_ludwig_deployment_class(
    num_replicas=1,
    ray_actor_options={"num_gpus": 1},
)

# Autoscale between 1 and 10 replicas based on queue length
app = DeploymentClass.options(
    autoscaling_config=AutoscalingConfig(
        min_replicas=1,
        max_replicas=10,
        target_num_ongoing_requests_per_replica=5,
    )
).bind("results/experiment_run/model")

serve.run(app, name="sentiment")

Traffic splitting (A/B testing)¶

from ray import serve
from ludwig.serve_ray_serve import make_ludwig_deployment_class

ModelV1 = make_ludwig_deployment_class(num_replicas=1)
ModelV2 = make_ludwig_deployment_class(num_replicas=1)

app_v1 = ModelV1.bind("results/model_v1")
app_v2 = ModelV2.bind("results/model_v2")

# Route 80% to v1, 20% to v2
serve.run(
    {"v1": serve.options(route_prefix="/v1")(app_v1), "v2": serve.options(route_prefix="/v2")(app_v2)},
)

KServe¶

KServe implements the Open Inference Protocol v2 for Kubernetes-native ML serving.

Install¶

pip install kserve

Serve locally¶

python -m ludwig.serve_kserve \
    --model_name sentiment \
    --model_path results/experiment_run/model

Or with Python:

from ludwig.serve_kserve import create_kserve_model_server

server = create_kserve_model_server(
    model_name="sentiment",
    model_path="results/experiment_run/model",
)
server.start([server.model])

Send v2 protocol requests¶

# Health check
curl http://localhost:8080/v2/health/live

# Model metadata
curl http://localhost:8080/v2/models/sentiment

# Inference (v2 input format)
curl -X POST http://localhost:8080/v2/models/sentiment/infer \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {"name": "text", "shape": [1], "datatype": "BYTES", "data": ["I love this!"]},
      {"name": "stars", "shape": [1], "datatype": "INT32", "data": [5]}
    ]
  }'

Deploy on Kubernetes with KServe¶

Create a InferenceService manifest:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: ludwig-sentiment
spec:
  predictor:
    containers:
      - name: predictor
        image: ludwigai/ludwig:latest
        command: ["python", "-m", "ludwig.serve_kserve"]
        args:
          - --model_name=sentiment
          - --model_path=/mnt/models
        volumeMounts:
          - name: model-volume
            mountPath: /mnt/models
    volumes:
      - name: model-volume
        persistentVolumeClaim:
          claimName: ludwig-models-pvc

kubectl apply -f inference_service.yaml
kubectl get inferenceservice ludwig-sentiment

Choosing between serving options¶

Option	Scale	Protocol	Best for
`ludwig serve` (FastAPI)	Single process	REST JSON	Development, simple APIs
Ray Serve	Multi-replica, autoscaling	REST JSON	Production on Ray clusters
KServe	Kubernetes-native	OIP v2 + REST	Enterprise Kubernetes
`ludwig serve` (vLLM)	GPU-optimized	OpenAI-compatible	LLM serving

Ray Serve and KServe

Ray Serve and KServe Deployment¶

Ray Serve¶

Install¶

Deploy¶

Send predictions¶

Autoscaling¶

Traffic splitting (A/B testing)¶

KServe¶

Install¶

Serve locally¶

Send v2 protocol requests¶

Deploy on Kubernetes with KServe¶

Choosing between serving options¶

See also¶