Serving

Serving Ludwig Models¶

Ludwig provides two serving options: a general-purpose REST API for ECD models and an OpenAI-compatible server for LLMs.

REST API Server¶

Starting the server¶

ludwig serve --model_path=/path/to/model

This spawns a FastAPI server with the following endpoints:

Endpoint	Method	Description
`/`	GET	Health check
`/info`	GET	Model metadata (input/output features, model type)
`/predict`	POST	Single-record prediction
`/batch_predict`	POST	Batch prediction
`/metrics`	GET	Prometheus metrics (if prometheus_client is installed)

Single prediction¶

Send a JSON body with feature values:

curl http://0.0.0.0:8000/predict -X POST \
  -H "Content-Type: application/json" \
  -d '{"text_feature": "words to predict", "number_feature": 42}'

Response:

{
  "output_predictions": "class_a",
  "output_probabilities_class_a": 0.82,
  "output_probabilities_class_b": 0.18,
  "output_probability": 0.82
}

Batch prediction¶

Send a JSON array:

curl http://0.0.0.0:8000/batch_predict -X POST \
  -H "Content-Type: application/json" \
  -d '[{"text": "first example"}, {"text": "second example"}]'

Features¶

Auto-generated Pydantic schemas from model config for request/response validation
Prometheus metrics at /metrics with request count and latency histograms
Structured logging of every request (method, path, status, duration, client)
Configurable timeout for long predictions (returns HTTP 504 on timeout)
Model hot-swap via dependency injection

Python API¶

from ludwig.serve_v2 import create_app

app = create_app(
    model_path="path/to/model",
    allowed_origins=["*"],
    prediction_timeout=300.0,
)
# Use with uvicorn, gunicorn, or any ASGI server

The same timeout is exposed as the --prediction_timeout flag on ludwig serve. Requests that exceed the timeout return HTTP 504 and are emitted through the structured-logging middleware so they can be aggregated by downstream log collectors.

LLM Serving with vLLM¶

For LLM models, Ludwig provides an OpenAI-compatible serving backend powered by vLLM with PagedAttention and continuous batching.

Prerequisites¶

pip install vllm

Starting the server¶

from ludwig.serve_vllm import run_vllm_server

run_vllm_server(
    model_path="path/to/ludwig/model",
    host="0.0.0.0",
    port=8000,
    gpu_memory_utilization=0.9,
)

OpenAI-compatible endpoints¶

The vLLM server exposes the same API as OpenAI's API:

# Chat completions
curl http://localhost:8000/v1/chat/completions -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ludwig-llm",
    "messages": [{"role": "user", "content": "What is machine learning?"}],
    "temperature": 0.7,
    "max_tokens": 256
  }'

# Text completions
curl http://localhost:8000/v1/completions -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ludwig-llm",
    "prompt": "Machine learning is",
    "max_tokens": 100
  }'

# List models
curl http://localhost:8000/v1/models

Configuration¶

Parameter	Default	Description
`model_path`	required	Path to Ludwig model or HuggingFace model ID
`gpu_memory_utilization`	0.9	Fraction of GPU memory to use
`tensor_parallel_size`	1	Number of GPUs for tensor parallelism
`quantization`	None	Quantization method: awq, gptq, fp8
`max_model_len`	auto	Maximum sequence length

Using with OpenAI Python client¶

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="ludwig-llm",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Ray Serve¶

Ray Serve provides autoscaling multi-replica serving on a Ray cluster. Use this for production deployments where you need horizontal scaling or traffic splitting.

Install¶

pip install "ludwig[distributed]"

Deploy¶

import ray
from ludwig.serve_ray_serve import deploy_ludwig_model

ray.init(ignore_reinit_error=True)

handle = deploy_ludwig_model(
    model_path="results/experiment_run/model",
    name="sentiment",
    num_replicas=2,
    ray_actor_options={"num_gpus": 1},  # omit for CPU
)

Or via the CLI helper:

python examples/serving/ray_serve/deploy.py \
    --model_path ./results/my_model \
    --num_replicas 2 \
    --gpu \
    --block

Send predictions¶

curl -X POST http://localhost:8000/sentiment \
  -H "Content-Type: application/json" \
  -d '{"text": "I love this product!"}'

# Batch (list of records)
curl -X POST http://localhost:8000/sentiment \
  -H "Content-Type: application/json" \
  -d '[{"text": "great"}, {"text": "terrible"}]'

KServe¶

KServe is the standard for Kubernetes ML inference, implementing the Open Inference Protocol v2. Use this for cloud/Kubernetes production deployments.

Install¶

pip install kserve

Serve locally¶

python -m ludwig.serve_kserve \
    --model_name sentiment \
    --model_path results/experiment_run/model

Send v2 protocol requests¶

curl -X POST http://localhost:8080/v2/models/sentiment/infer \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {"name": "text", "shape": [1], "datatype": "BYTES", "data": ["I love this!"]}
    ]
  }'

For full deployment examples (autoscaling, A/B testing, Kubernetes manifests) see the Ray Serve and KServe example.

Choosing a serving option¶

Option	Scale	Protocol	Best for
`ludwig serve` (FastAPI)	Single process	REST JSON	Development, simple APIs
Ray Serve	Multi-replica, autoscaling	REST JSON	Production on Ray clusters
KServe	Kubernetes-native	OIP v2	Enterprise Kubernetes
`ludwig serve` (vLLM)	GPU-optimized	OpenAI-compatible	LLM serving