--- title: "Calibrating Binary Probabilities" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Calibrating Binary Probabilities} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(calibratr) library(dplyr) ``` ## Why calibration matters A classifier can rank observations accurately while producing probabilities that are not calibrated. A probability of `0.8` is calibrated only if events with that prediction occur about 80 percent of the time. Calibration matters when a decision uses the numerical probability, for example in risk thresholds or cost sensitive decisions. It matters less when only the ranking is used. ## A three-split workflow Calibration should be fitted on data not used to train the classifier. A common workflow uses three parts: a model training set, a calibration set, and a test set. This vignette starts from already computed probabilities, so only the calibration and test split are shown. ```{r data} set.seed(2026) n <- 800 predictions <- data.frame(x = rnorm(n)) |> mutate( true_p = inv_logit(-0.5 + 1.2 * x), y = rbinom(n(), 1, true_p), raw_logits = 1.7 * (-0.5 + 1.2 * x), raw_p = inv_logit(raw_logits), split = sample(rep(c("calibration", "test"), each = n / 2)) ) calibration <- predictions |> filter(split == "calibration") test <- predictions |> filter(split == "test") ``` ## Fit a calibrator Beta calibration works directly on probabilities. It is a useful default when the raw model probabilities show sigmoid-shaped miscalibration. ```{r beta} beta_fit <- cal_beta(calibration$raw_p, calibration$y) test <- test |> mutate(beta = predict(beta_fit, raw_p)) test |> summarise( raw_ece = ece(raw_p, y, bins = 10), beta_ece = ece(beta, y, bins = 10) ) ``` ## Compare methods The package exposes the main binary calibration methods through the same fit-predict pattern. ```{r methods} platt_fit <- cal_platt(calibration$raw_p, calibration$y) iso_fit <- cal_isotonic(calibration$raw_p, calibration$y) hist_fit <- cal_histogram(calibration$raw_p, calibration$y, bins = 10) temp_fit <- cal_temperature(calibration$raw_logits, calibration$y) test <- test |> mutate( platt = predict(platt_fit, raw_p), isotonic = predict(iso_fit, raw_p), histogram = predict(hist_fit, raw_p), temperature = predict(temp_fit, raw_logits) ) bind_rows( test |> summarise(method = "raw", ece = ece(raw_p, y, bins = 10)), test |> summarise(method = "platt", ece = ece(platt, y, bins = 10)), test |> summarise(method = "beta", ece = ece(beta, y, bins = 10)), test |> summarise(method = "isotonic", ece = ece(isotonic, y, bins = 10)), test |> summarise(method = "histogram", ece = ece(histogram, y, bins = 10)), test |> summarise(method = "temperature", ece = ece(temperature, y, bins = 10)) ) |> arrange(ece) ``` ## Reliability diagram The reliability diagram shows calibration by bin. Points close to the diagonal have similar mean predicted probability and observed event frequency. ```{r diagram, fig.width = 6, fig.height = 5, fig.alt = "Reliability diagram with points near the diagonal, comparing predicted probability and observed event frequency by bin."} reliability_diagram(test$beta, test$y, bins = 10) ``` ## Cross-validated calibration When the calibration set is small, `cal_cv()` produces out-of-fold calibrated probabilities while also fitting a final calibrator on all observations. ```{r cv} cv_fit <- cal_cv( predictions$raw_p, predictions$y, method = "histogram", folds = 5, bins = 10, seed = 1 ) predictions |> mutate(oof = cv_fit$oof_predictions) |> summarise(oof_ece = ece(oof, y, bins = 10)) ``` ## Optional reference validation The package includes optional tests that compare selected results against external reference implementations. These tests are not run for ordinary users unless the optional dependencies are installed. | Reference | What is compared | Package dependency | |---|---|---| | Python `netcal` | `ece()`, `mce()`, `ace()` | `reticulate` and Python `netcal` | | Python `netcal` | `cal_histogram()` with equal-width bins | `reticulate` and Python `netcal` | | R `betacal` | `cal_beta()` predictions | `betacal` | This keeps the runtime package in R while still allowing numerical checks against the reference implementation during development. ## Current scope The current scope covers binary and multiclass probability calibration for predictions that were already produced by another model. Neural calibration, Bayesian binning, and direct integration with model-training frameworks are not part of the package interface.