Skip to content

Probability Calibration

Probability Calibration

A well-calibrated model produces confidence scores that reflect empirical accuracy: when it says it is 80% confident, it should be correct approximately 80% of the time. Many neural networks are overconfident by default.

Ludwig supports temperature scaling (Guo et al., ICML 2017) directly in the decoder config — no post-processing step needed.

What is temperature scaling?

Temperature scaling divides the logits by a learnable scalar T before applying softmax:

p = softmax(logits / T)
  • T > 1 softens the distribution (reduces overconfidence).
  • T < 1 sharpens the distribution (rarely needed).
  • T = 1 is the uncalibrated model.

Temperature scaling never changes the argmax prediction — it only re-shapes the probability distribution. The scalar T is fit on the validation set after normal training completes.

Configuration

Add calibration: temperature_scaling to the decoder section of any category output feature:

model_type: ecd

input_features:
  - name: text
    type: text
    encoder:
      type: bert
      use_pretrained: true
      trainable: false

output_features:
  - name: label
    type: category
    decoder:
      type: classifier
      calibration: temperature_scaling

trainer:
  epochs: 10
  optimizer:
    type: adamw
    lr: 2.0e-5

Ludwig automatically runs temperature scaling on the validation split at the end of training and serializes T with the model. The learned temperature is applied transparently at inference.

Full pipeline example

import pandas as pd
from ludwig.api import LudwigModel
from sklearn.model_selection import train_test_split

df = pd.read_csv("agnews.csv")
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

model = LudwigModel(config="calibrated_config.yaml", logging_level=1)
model.train(dataset=train_df, output_directory="results/")

# Predict — returned probabilities are calibrated
predictions, _ = model.predict(dataset=test_df)
print(predictions[["label_predictions", "label_probabilities_<CAT>"]])

Evaluating calibration

Calibration is measured with the Expected Calibration Error (ECE): the weighted average of |accuracy − confidence| across probability buckets. Lower is better; a perfectly calibrated model has ECE = 0.

import numpy as np

def ece(confidences, labels, n_bins=15):
    bins = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    for lo, hi in zip(bins[:-1], bins[1:]):
        mask = (confidences > lo) & (confidences <= hi)
        if mask.sum() == 0:
            continue
        acc = labels[mask].mean()
        conf = confidences[mask].mean()
        ece += mask.mean() * abs(acc - conf)
    return ece

# Extract max probability from Ludwig predictions
confs = predictions["label_probabilities"].apply(max).values
labels = (predictions["label_predictions"] == test_df["label"].values).astype(float)
print(f"ECE: {ece(confs, labels):.4f}")

Monte Carlo dropout uncertainty

For an alternative uncertainty estimate that does not require a validation set, use MC dropout:

output_features:
  - name: label
    type: category
    decoder:
      type: mlp_classifier
      num_fc_layers: 2
      dropout: 0.2
      mc_dropout_samples: 20

With mc_dropout_samples > 0, Ludwig runs the decoder 20 times at inference with dropout active and returns the mean prediction and an uncertainty tensor capturing variance across samples.

When to calibrate

  • Safety-critical applications where predicted confidence is acted upon (medical, finance, autonomous systems).
  • Threshold-based classifiers where a specific confidence level triggers an action.
  • Ensembles and distillation — calibrated soft targets improve student model quality.
  • Retrieval and ranking — calibrated scores enable meaningful score comparison across classes.

Temperature scaling is lightweight and almost always improves calibration without hurting accuracy, so it is recommended as a default for production classification models.