Choose a calibrator based on the input scale, the amount of calibration data, and the shape of the expected miscalibration. Binary methods take vectors, and multiclass methods take matrices with one column per class.
| Method | Input scale | Useful when | Main caution |
|---|---|---|---|
cal_platt() |
score or probability | the calibration curve is close to logistic | limited flexibility |
cal_temperature() |
logits | a model is overconfident but ranking is useful | requires logits, not probabilities |
cal_beta() |
probability | probabilities have asymmetric distortion | clips 0 and 1 before fitting |
cal_isotonic() |
probability | many calibration observations are available | can overfit small calibration sets |
cal_histogram() |
probability | interpretability by bins is preferred | depends on bin choice |
The most common mistake is passing probabilities to a method that
expects logits. cal_temperature() expects logits because it
estimates a scalar temperature in logit space.
The example below simulates probabilities that are too confident. The raw probabilities and calibrated probabilities are compared on the same test set.
set.seed(2027)
n <- 500
predictions <- data.frame(x = rnorm(n)) |>
mutate(
true_p = inv_logit(-0.2 + x),
y = rbinom(n(), 1, true_p),
raw_logits = 1.6 * (-0.2 + x),
raw_p = inv_logit(raw_logits),
split = sample(rep(c("calibration", "test"), length.out = n))
)
calibration <- predictions |>
filter(split == "calibration")
test <- predictions |>
filter(split == "test")
fits <- list(
platt = cal_platt(calibration$raw_p, calibration$y),
beta = cal_beta(calibration$raw_p, calibration$y),
isotonic = cal_isotonic(calibration$raw_p, calibration$y),
histogram = cal_histogram(calibration$raw_p, calibration$y, bins = 10),
temperature = cal_temperature(calibration$raw_logits, calibration$y)
)
test <- test |>
mutate(
platt = predict(fits$platt, raw_p),
beta = predict(fits$beta, raw_p),
isotonic = predict(fits$isotonic, raw_p),
histogram = predict(fits$histogram, raw_p),
temperature = predict(fits$temperature, raw_logits)
)
bind_rows(
test |> summarise(method = "raw", ece = ece(raw_p, y, bins = 10)),
test |> summarise(method = "platt", ece = ece(platt, y, bins = 10)),
test |> summarise(method = "beta", ece = ece(beta, y, bins = 10)),
test |> summarise(method = "isotonic", ece = ece(isotonic, y, bins = 10)),
test |> summarise(method = "histogram", ece = ece(histogram, y, bins = 10)),
test |> summarise(method = "temperature", ece = ece(temperature, y, bins = 10))
) |>
mutate(ece = round(ece, 3)) |>
arrange(ece)
#> method ece
#> 1 histogram 0.069
#> 2 platt 0.087
#> 3 temperature 0.091
#> 4 raw 0.109
#> 5 isotonic 0.111
#> 6 beta 0.121Start with cal_beta() or cal_platt() for
small to moderate calibration sets. Use cal_temperature()
when logits are available and the problem is mainly overconfidence. Use
cal_isotonic() when the calibration set is large enough to
support a flexible monotone curve. Use cal_histogram() when
a bin-level rule is easier to audit or explain.
Cross-validated calibration is useful when a separate calibration set would be too small.