--- title: "Applied Calibration Workflow" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Applied Calibration Workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(calibratr) library(dplyr) ``` ## Goal This vignette shows a complete calibration workflow with a dataset included in R. The example uses `iris` as a binary classification problem: `versicolor` versus `virginica`. The important point is the data split. The classifier is fitted on a training set. The calibrator is fitted on a calibration set. The final assessment uses a test set that was not used in either fitting step. ## Prepare the data ```{r data} set.seed(1001) iris_binary <- iris |> filter(Species != "setosa") |> mutate(y = as.integer(Species == "virginica")) |> group_by(y) |> mutate( split = sample(rep( c("train", "calibration", "test"), times = c(25, 12, 13) )) ) |> ungroup() iris_binary |> count(split, y) ``` ## Fit a classifier The classifier is deliberately simple. The goal is not to optimize predictive performance, but to produce probabilities that can be evaluated and calibrated. ```{r classifier} train <- iris_binary |> filter(split == "train") calibration <- iris_binary |> filter(split == "calibration") test <- iris_binary |> filter(split == "test") classifier <- glm( y ~ Sepal.Length + Sepal.Width, data = train, family = binomial() ) calibration <- calibration |> mutate(raw_p = predict(classifier, calibration, type = "response")) test <- test |> mutate(raw_p = predict(classifier, test, type = "response")) ``` ## Fit calibrators Here we fit two calibrators on the calibration set. `cal_beta()` works directly on probabilities. `cal_platt()` can be used on raw probabilities or scores. ```{r calibrators} beta_fit <- cal_beta(calibration$raw_p, calibration$y) platt_fit <- cal_platt(calibration$raw_p, calibration$y) test <- test |> mutate( beta = predict(beta_fit, raw_p), platt = predict(platt_fit, raw_p) ) ``` ## Compare calibration metrics Calibration metrics are computed only on the test set. ```{r metrics} metric_table <- bind_rows( test |> summarise(method = "raw", ece = ece(raw_p, y, bins = 5), mce = mce(raw_p, y, bins = 5), ace = ace(raw_p, y, bins = 5)), test |> summarise(method = "beta", ece = ece(beta, y, bins = 5), mce = mce(beta, y, bins = 5), ace = ace(beta, y, bins = 5)), test |> summarise(method = "platt", ece = ece(platt, y, bins = 5), mce = mce(platt, y, bins = 5), ace = ace(platt, y, bins = 5)) ) |> mutate(across(where(is.numeric), function(x) round(x, 3))) metric_table ``` The best method is data dependent. A calibrator should be chosen on a validation criterion that matches the intended use of the probabilities. ## Plot the calibrated probabilities ```{r diagram, fig.width = 6, fig.height = 5, fig.alt = "Reliability diagram for beta-calibrated iris probabilities, with binned points compared to the diagonal line."} reliability_diagram(test$beta, test$y, bins = 5) ``` The diagonal represents perfect calibration. Points above the diagonal indicate bins where the observed event frequency is higher than the mean predicted probability. Points below the diagonal indicate overconfident predictions.