Serving
Serving Ludwig Models¶
Ludwig provides two serving options: a general-purpose REST API for ECD models and an OpenAI-compatible server for LLMs.
REST API Server¶
Starting the server¶
ludwig serve --model_path=/path/to/model
This spawns a FastAPI server with the following endpoints:
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Health check |
/info |
GET | Model metadata (input/output features, model type) |
/predict |
POST | Single-record prediction |
/batch_predict |
POST | Batch prediction |
/metrics |
GET | Prometheus metrics (if prometheus_client is installed) |
Single prediction¶
Send a JSON body with feature values:
curl http://0.0.0.0:8000/predict -X POST \
-H "Content-Type: application/json" \
-d '{"text_feature": "words to predict", "number_feature": 42}'
Response:
{
"output_predictions": "class_a",
"output_probabilities_class_a": 0.82,
"output_probabilities_class_b": 0.18,
"output_probability": 0.82
}
Batch prediction¶
Send a JSON array:
curl http://0.0.0.0:8000/batch_predict -X POST \
-H "Content-Type: application/json" \
-d '[{"text": "first example"}, {"text": "second example"}]'
Features¶
- Auto-generated Pydantic schemas from model config for request/response validation
- Prometheus metrics at
/metricswith request count and latency histograms - Structured logging of every request (method, path, status, duration, client)
- Configurable timeout for long predictions (returns HTTP 504 on timeout)
- Model hot-swap via dependency injection
Python API¶
from ludwig.serve_v2 import create_app
app = create_app(
model_path="path/to/model",
allowed_origins=["*"],
prediction_timeout=300.0,
)
# Use with uvicorn, gunicorn, or any ASGI server
The same timeout is exposed as the --prediction_timeout flag on ludwig serve. Requests
that exceed the timeout return HTTP 504 and are emitted through the structured-logging
middleware so they can be aggregated by downstream log collectors.
LLM Serving with vLLM¶
For LLM models, Ludwig provides an OpenAI-compatible serving backend powered by vLLM with PagedAttention and continuous batching.
Prerequisites¶
pip install vllm
Starting the server¶
from ludwig.serve_vllm import run_vllm_server
run_vllm_server(
model_path="path/to/ludwig/model",
host="0.0.0.0",
port=8000,
gpu_memory_utilization=0.9,
)
OpenAI-compatible endpoints¶
The vLLM server exposes the same API as OpenAI's API:
# Chat completions
curl http://localhost:8000/v1/chat/completions -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "ludwig-llm",
"messages": [{"role": "user", "content": "What is machine learning?"}],
"temperature": 0.7,
"max_tokens": 256
}'
# Text completions
curl http://localhost:8000/v1/completions -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "ludwig-llm",
"prompt": "Machine learning is",
"max_tokens": 100
}'
# List models
curl http://localhost:8000/v1/models
Configuration¶
| Parameter | Default | Description |
|---|---|---|
model_path |
required | Path to Ludwig model or HuggingFace model ID |
gpu_memory_utilization |
0.9 | Fraction of GPU memory to use |
tensor_parallel_size |
1 | Number of GPUs for tensor parallelism |
quantization |
None | Quantization method: awq, gptq, fp8 |
max_model_len |
auto | Maximum sequence length |
Using with OpenAI Python client¶
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="ludwig-llm",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)