Choosing a Calibrator

library(calibratr)
library(dplyr)

The main decision

Choose a calibrator based on the input scale, the amount of calibration data, and the shape of the expected miscalibration. Binary methods take vectors, and multiclass methods take matrices with one column per class.

Method Input scale Useful when Main caution
cal_platt() score or probability the calibration curve is close to logistic limited flexibility
cal_temperature() logits a model is overconfident but ranking is useful requires logits, not probabilities
cal_beta() probability probabilities have asymmetric distortion clips 0 and 1 before fitting
cal_isotonic() probability many calibration observations are available can overfit small calibration sets
cal_histogram() probability interpretability by bins is preferred depends on bin choice

Match the input scale

The most common mistake is passing probabilities to a method that expects logits. cal_temperature() expects logits because it estimates a scalar temperature in logit space.

p <- c(0.1, 0.3, 0.7, 0.9)
logits <- logit(p)
round(inv_logit(logits), 3)
#> [1] 0.1 0.3 0.7 0.9

Compare methods on held-out data

The example below simulates probabilities that are too confident. The raw probabilities and calibrated probabilities are compared on the same test set.

set.seed(2027)
n <- 500
predictions <- data.frame(x = rnorm(n)) |>
  mutate(
    true_p = inv_logit(-0.2 + x),
    y = rbinom(n(), 1, true_p),
    raw_logits = 1.6 * (-0.2 + x),
    raw_p = inv_logit(raw_logits),
    split = sample(rep(c("calibration", "test"), length.out = n))
  )

calibration <- predictions |>
  filter(split == "calibration")

test <- predictions |>
  filter(split == "test")

fits <- list(
  platt = cal_platt(calibration$raw_p, calibration$y),
  beta = cal_beta(calibration$raw_p, calibration$y),
  isotonic = cal_isotonic(calibration$raw_p, calibration$y),
  histogram = cal_histogram(calibration$raw_p, calibration$y, bins = 10),
  temperature = cal_temperature(calibration$raw_logits, calibration$y)
)

test <- test |>
  mutate(
    platt = predict(fits$platt, raw_p),
    beta = predict(fits$beta, raw_p),
    isotonic = predict(fits$isotonic, raw_p),
    histogram = predict(fits$histogram, raw_p),
    temperature = predict(fits$temperature, raw_logits)
  )

bind_rows(
  test |> summarise(method = "raw", ece = ece(raw_p, y, bins = 10)),
  test |> summarise(method = "platt", ece = ece(platt, y, bins = 10)),
  test |> summarise(method = "beta", ece = ece(beta, y, bins = 10)),
  test |> summarise(method = "isotonic", ece = ece(isotonic, y, bins = 10)),
  test |> summarise(method = "histogram", ece = ece(histogram, y, bins = 10)),
  test |> summarise(method = "temperature", ece = ece(temperature, y, bins = 10))
) |>
  mutate(ece = round(ece, 3)) |>
  arrange(ece)
#>        method   ece
#> 1   histogram 0.069
#> 2       platt 0.087
#> 3 temperature 0.091
#> 4         raw 0.109
#> 5    isotonic 0.111
#> 6        beta 0.121

Practical guidance

Start with cal_beta() or cal_platt() for small to moderate calibration sets. Use cal_temperature() when logits are available and the problem is mainly overconfidence. Use cal_isotonic() when the calibration set is large enough to support a flexible monotone curve. Use cal_histogram() when a bin-level rule is easier to audit or explain.

Cross-validated calibration is useful when a separate calibration set would be too small.

cv_fit <- cal_cv(predictions$raw_p, predictions$y, method = "beta", folds = 5, seed = 1)

predictions |>
  mutate(oof = cv_fit$oof_predictions) |>
  summarise(oof_ece = ece(oof, y, bins = 10))
#>      oof_ece
#> 1 0.03067008