---
title: "Choosing a Calibrator"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Choosing a Calibrator}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(calibratr)
library(dplyr)
```

## The main decision

Choose a calibrator based on the input scale, the amount of calibration data,
and the shape of the expected miscalibration. Binary methods take vectors, and
multiclass methods take matrices with one column per class.

| Method | Input scale | Useful when | Main caution |
|---|---|---|---|
| `cal_platt()` | score or probability | the calibration curve is close to logistic | limited flexibility |
| `cal_temperature()` | logits | a model is overconfident but ranking is useful | requires logits, not probabilities |
| `cal_beta()` | probability | probabilities have asymmetric distortion | clips `0` and `1` before fitting |
| `cal_isotonic()` | probability | many calibration observations are available | can overfit small calibration sets |
| `cal_histogram()` | probability | interpretability by bins is preferred | depends on bin choice |

## Match the input scale

The most common mistake is passing probabilities to a method that expects
logits. `cal_temperature()` expects logits because it estimates a scalar
temperature in logit space.

```{r scales}
p <- c(0.1, 0.3, 0.7, 0.9)
logits <- logit(p)
round(inv_logit(logits), 3)
```

## Compare methods on held-out data

The example below simulates probabilities that are too confident. The raw
probabilities and calibrated probabilities are compared on the same test set.

```{r comparison}
set.seed(2027)
n <- 500
predictions <- data.frame(x = rnorm(n)) |>
  mutate(
    true_p = inv_logit(-0.2 + x),
    y = rbinom(n(), 1, true_p),
    raw_logits = 1.6 * (-0.2 + x),
    raw_p = inv_logit(raw_logits),
    split = sample(rep(c("calibration", "test"), length.out = n))
  )

calibration <- predictions |>
  filter(split == "calibration")

test <- predictions |>
  filter(split == "test")

fits <- list(
  platt = cal_platt(calibration$raw_p, calibration$y),
  beta = cal_beta(calibration$raw_p, calibration$y),
  isotonic = cal_isotonic(calibration$raw_p, calibration$y),
  histogram = cal_histogram(calibration$raw_p, calibration$y, bins = 10),
  temperature = cal_temperature(calibration$raw_logits, calibration$y)
)

test <- test |>
  mutate(
    platt = predict(fits$platt, raw_p),
    beta = predict(fits$beta, raw_p),
    isotonic = predict(fits$isotonic, raw_p),
    histogram = predict(fits$histogram, raw_p),
    temperature = predict(fits$temperature, raw_logits)
  )

bind_rows(
  test |> summarise(method = "raw", ece = ece(raw_p, y, bins = 10)),
  test |> summarise(method = "platt", ece = ece(platt, y, bins = 10)),
  test |> summarise(method = "beta", ece = ece(beta, y, bins = 10)),
  test |> summarise(method = "isotonic", ece = ece(isotonic, y, bins = 10)),
  test |> summarise(method = "histogram", ece = ece(histogram, y, bins = 10)),
  test |> summarise(method = "temperature", ece = ece(temperature, y, bins = 10))
) |>
  mutate(ece = round(ece, 3)) |>
  arrange(ece)
```

## Practical guidance

Start with `cal_beta()` or `cal_platt()` for small to moderate calibration sets.
Use `cal_temperature()` when logits are available and the problem is mainly
overconfidence. Use `cal_isotonic()` when the calibration set is large enough to
support a flexible monotone curve. Use `cal_histogram()` when a bin-level rule is
easier to audit or explain.

Cross-validated calibration is useful when a separate calibration set would be
too small.

```{r cv}
cv_fit <- cal_cv(predictions$raw_p, predictions$y, method = "beta", folds = 5, seed = 1)

predictions |>
  mutate(oof = cv_fit$oof_predictions) |>
  summarise(oof_ece = ece(oof, y, bins = 10))
```