Choosing a Calibrator

library(calibratr)
library(dplyr)

The main decision

Choose a calibrator based on the input scale, the amount of calibration data, and the shape of the expected miscalibration. Binary methods take vectors, and multiclass methods take matrices with one column per class.

Method	Input scale	Useful when	Main caution
`cal_platt()`	score or probability	the calibration curve is close to logistic	limited flexibility
`cal_temperature()`	logits	a model is overconfident but ranking is useful	requires logits, not probabilities
`cal_beta()`	probability	probabilities have asymmetric distortion	clips `0` and `1` before fitting
`cal_isotonic()`	probability	many calibration observations are available	can overfit small calibration sets
`cal_histogram()`	probability	interpretability by bins is preferred	depends on bin choice

Match the input scale

The most common mistake is passing probabilities to a method that expects logits. cal_temperature() expects logits because it estimates a scalar temperature in logit space.

p <- c(0.1, 0.3, 0.7, 0.9)
logits <- logit(p)
round(inv_logit(logits), 3)
#> [1] 0.1 0.3 0.7 0.9

Compare methods on held-out data

The example below simulates probabilities that are too confident. The raw probabilities and calibrated probabilities are compared on the same test set.

set.seed(2027)
n <- 500
predictions <- data.frame(x = rnorm(n)) |>
  mutate(
    true_p = inv_logit(-0.2 + x),
    y = rbinom(n(), 1, true_p),
    raw_logits = 1.6 * (-0.2 + x),
    raw_p = inv_logit(raw_logits),
    split = sample(rep(c("calibration", "test"), length.out = n))
  )

calibration <- predictions |>
  filter(split == "calibration")

test <- predictions |>
  filter(split == "test")

fits <- list(
  platt = cal_platt(calibration$raw_p, calibration$y),
  beta = cal_beta(calibration$raw_p, calibration$y),
  isotonic = cal_isotonic(calibration$raw_p, calibration$y),
  histogram = cal_histogram(calibration$raw_p, calibration$y, bins = 10),
  temperature = cal_temperature(calibration$raw_logits, calibration$y)
)

test <- test |>
  mutate(
    platt = predict(fits$platt, raw_p),
    beta = predict(fits$beta, raw_p),
    isotonic = predict(fits$isotonic, raw_p),
    histogram = predict(fits$histogram, raw_p),
    temperature = predict(fits$temperature, raw_logits)
  )

bind_rows(
  test |> summarise(method = "raw", ece = ece(raw_p, y, bins = 10)),
  test |> summarise(method = "platt", ece = ece(platt, y, bins = 10)),
  test |> summarise(method = "beta", ece = ece(beta, y, bins = 10)),
  test |> summarise(method = "isotonic", ece = ece(isotonic, y, bins = 10)),
  test |> summarise(method = "histogram", ece = ece(histogram, y, bins = 10)),
  test |> summarise(method = "temperature", ece = ece(temperature, y, bins = 10))
) |>
  mutate(ece = round(ece, 3)) |>
  arrange(ece)
#>        method   ece
#> 1   histogram 0.069
#> 2       platt 0.087
#> 3 temperature 0.091
#> 4         raw 0.109
#> 5    isotonic 0.111
#> 6        beta 0.121

Practical guidance

Start with cal_beta() or cal_platt() for small to moderate calibration sets. Use cal_temperature() when logits are available and the problem is mainly overconfidence. Use cal_isotonic() when the calibration set is large enough to support a flexible monotone curve. Use cal_histogram() when a bin-level rule is easier to audit or explain.

Cross-validated calibration is useful when a separate calibration set would be too small.

cv_fit <- cal_cv(predictions$raw_p, predictions$y, method = "beta", folds = 5, seed = 1)

predictions |>
  mutate(oof = cv_fit$oof_predictions) |>
  summarise(oof_ece = ece(oof, y, bins = 10))
#>      oof_ece
#> 1 0.03067008