| Title: | Calibration of Binary and Multiclass Probabilities |
|---|---|
| Description: | Provides S3 calibrators, metrics, and diagnostics for binary and multiclass probability calibration in R. Binary methods include Platt scaling, temperature scaling, beta calibration, histogram binning, and isotonic regression. Multiclass methods include temperature scaling, vector scaling, Dirichlet calibration, and a one-vs-rest wrapper for the binary calibrators. Methods follow Platt (1999), Zadrozny and Elkan (2002) <doi:10.1145/775047.775151>, Guo et al. (2017), Kull et al. (2017) <doi:10.1214/17-EJS1338SI>, and Kull et al. (2019). |
| Authors: | Pedro Rafael Diniz Marinho [aut, cre] (ORCID: <https://orcid.org/0000-0003-1591-8300>) |
| Maintainer: | Pedro Rafael Diniz Marinho <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-06-23 16:32:20 UTC |
| Source: | https://github.com/prdm0/calibratr |
ace() returns the empirical unweighted mean absolute calibration gap over
non-empty equal-width bins. Unlike ece(), each non-empty bin contributes
equally. For multiclass inputs the "classwise" form averages the binary ACE
over the one-vs-rest columns and the "confidence" form uses the top-label
confidence.
ace(p, y, bins = 10, type = c("classwise", "confidence"))ace(p, y, bins = 10, type = c("classwise", "confidence"))
p |
Predicted probabilities. A numeric vector in |
y |
Outcome labels. A vector coded as |
bins |
Number of equal-width bins on |
type |
Multiclass aggregation, either |
Using the same bin notation and endpoint convention as ece(), let
be the number of non-empty bins. The binary empirical average calibration
error is
Unlike ECE, ACE does not weight bins by their sample sizes. Sparse bins and
dense bins therefore contribute equally once they are non-empty. This
implementation uses equal-width bins on [0, 1]; it does not construct
adaptive or equal-frequency bins. For a multiclass probability matrix,
type = "classwise" returns the arithmetic mean of the one-vs-rest binary
ACE values,
type = "confidence" returns
using top-label confidence and correctness.
A single numeric value.
Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. Proceedings of the 22nd International Conference on Machine Learning.
predictions <- data.frame( p = c(0.10, 0.20, 0.80, 0.90), y = c(0, 0, 1, 1) ) predictions |> dplyr::summarise(ace = ace(p, y, bins = 2))predictions <- data.frame( p = c(0.10, 0.20, 0.80, 0.90), y = c(0, 0, 1, 1) ) predictions |> dplyr::summarise(ace = ace(p, y, bins = 2))
cal_beta() fits the beta calibration model
inv_logit(a * log(p) - b * log(1 - p) + c). Probabilities are clipped to
to have lower bound eps and upper bound 1 - eps before taking logarithms.
cal_beta(p, y, eps = 1e-15)cal_beta(p, y, eps = 1e-15)
p |
Numeric vector of uncalibrated probabilities in |
y |
Binary outcome vector coded as |
eps |
Clipping constant satisfying |
Beta calibration treats the uncalibrated event probability through
two log-transformed features. Before the transformation, probabilities are
clipped by
The calibrated probability is
The implementation fits an ordinary unpenalized binomial glm() with the
original binary labels, without Platt target correction. Its linear
predictor is
Equivalently, the fitted coefficients minimize the binomial cross-entropy
The beta-calibration parameters are the following reparameterization of the
fitted glm() coefficients:
Thus prediction first computes
and then
evaluates
The object element coefficients contains
from glm(), while a, b, and c contain the
reparameterized beta-calibration coefficients. Since
,
monotone increase on (0, 1) is guaranteed when
and . The implementation does
not impose these constraints.
A cal_beta object. Use predict() with new probabilities to obtain
calibrated probabilities.
Kull, M., Silva Filho, T. M., & Flach, P. (2017). Beta calibration: A well-founded and easily implemented improvement on logistic calibration for binary classifiers. Electronic Journal of Statistics, 11(2), 5052-5080. https://doi.org/10.1214/17-EJS1338SI.
set.seed(3) calibration <- data.frame(raw_p = stats::rbeta(120, 2, 2)) |> dplyr::mutate(y = rbinom(dplyr::n(), 1, raw_p)) fit <- cal_beta(calibration$raw_p, calibration$y) calibration |> dplyr::mutate(calibrated = predict(fit, raw_p)) |> dplyr::summarise( raw_ece = ece(raw_p, y, bins = 10), calibrated_ece = ece(calibrated, y, bins = 10) )set.seed(3) calibration <- data.frame(raw_p = stats::rbeta(120, 2, 2)) |> dplyr::mutate(y = rbinom(dplyr::n(), 1, raw_p)) fit <- cal_beta(calibration$raw_p, calibration$y) calibration |> dplyr::mutate(calibrated = predict(fit, raw_p)) |> dplyr::summarise( raw_ece = ece(raw_p, y, bins = 10), calibrated_ece = ece(calibrated, y, bins = 10) )
cal_cv() fits a calibrator with out-of-fold predictions. The function
expects scores, probabilities, or logits that were already produced by a
model. It does not train the underlying classifier.
cal_cv( x, y, method = c("platt", "temperature", "beta", "isotonic", "histogram", "vector", "dirichlet", "ovr"), folds = 5, seed = NULL, ... )cal_cv( x, y, method = c("platt", "temperature", "beta", "isotonic", "histogram", "vector", "dirichlet", "ovr"), folds = 5, seed = NULL, ... )
x |
Numeric vector of uncalibrated values for binary calibration, or a
numeric matrix with one column per class for multiclass calibration. Use
logits for |
y |
Binary outcome vector coded as |
method |
Calibration method. |
folds |
Number of stratified folds. Must be a single integer at least
|
seed |
Optional integer seed used only for fold assignment. |
... |
Additional arguments passed to the selected calibrator, such as
|
Folds are stratified by the outcome. The returned object stores the
out-of-fold calibrated probabilities and a final calibrator fitted on all
observations for future prediction. Binary and multiclass problems are
handled through the type of x. A numeric vector triggers binary
calibration. A numeric matrix with one column per class triggers multiclass
calibration, the out-of-fold predictions become a matrix, and the available
methods are "temperature", "vector", "dirichlet", and "ovr". For
method = "ovr", pass the binary method through base_method.
Cross-validated calibration estimates how the calibration map behaves on
observations not used to fit that map. Let
denote the fold assigned to observation . For each fold , a
calibrator is fitted using observations with
. The out-of-fold calibrated prediction for an observation in
fold is then
These out-of-fold predictions are stored in oof_predictions and are useful
for estimating calibration metrics without evaluating a calibrator on the
same observations used to fit it. In binary calibration,
is a scalar event probability.
In multiclass calibration, it is the row vector
on the
probability simplex. After the out-of-fold predictions are computed, a final
calibrator is fitted on all observations. The S3 predict()
method for a cal_cv object uses this final calibrator for future data.
The folds are stratified by the observed labels. Setting seed affects only
the fold assignment and restores the previous random-number state after the
assignment is made. The function assumes that x already contains model
outputs from another classifier; it does not refit that classifier inside
each fold. Thus the predictions are out of fold for the calibration map only,
unless x itself was produced out of fold by the underlying classifier.
folds must be at least 2 and no larger than the smallest class count.
Within each class, observations are randomly permuted and assigned fold
labels in sequence. For
multiclass inputs, column corresponds to integer class code ;
if y is a factor, column corresponds to levels(y)[k]. For
method = "ovr", base_method is read from ...; if it is not supplied,
the default base method is "platt".
A cal_cv object. Use predict() to apply the final calibrator to
new values. The object stores fold_id, oof_predictions,
fold_calibrators, and final_calibrator. For binary calibration,
oof_predictions is a numeric vector. For multiclass calibration, it is a
numeric matrix with one row per observation and one column per class, with
column names given by the class levels.
set.seed(7) predictions <- data.frame(raw_p = stats::runif(120)) |> dplyr::mutate(y = rbinom(dplyr::n(), 1, raw_p)) fit <- cal_cv( predictions$raw_p, predictions$y, method = "histogram", folds = 3, bins = 5, seed = 1 ) predictions |> dplyr::mutate(calibrated = fit$oof_predictions) |> dplyr::summarise(ece = ece(calibrated, y, bins = 5))set.seed(7) predictions <- data.frame(raw_p = stats::runif(120)) |> dplyr::mutate(y = rbinom(dplyr::n(), 1, raw_p)) fit <- cal_cv( predictions$raw_p, predictions$y, method = "histogram", folds = 3, bins = 5, seed = 1 ) predictions |> dplyr::mutate(calibrated = fit$oof_predictions) |> dplyr::summarise(ece = ece(calibrated, y, bins = 5))
cal_dirichlet() is the multiclass generalization of beta calibration. It
fits a linear map on the log of the predicted probabilities followed by a
softmax, which is equivalent to a multinomial logistic regression with the
log-probabilities as features. An off-diagonal and intercept regularization
(ODIR) penalty shrinks the off-diagonal weights and the intercepts toward
zero, which reduces overfitting risk when the number of classes is large.
cal_dirichlet(p, y, lambda = NULL, eps = 1e-12)cal_dirichlet(p, y, lambda = NULL, eps = 1e-12)
p |
Numeric matrix of uncalibrated probabilities with one row per
observation and one column per class. Rows must sum to one within absolute
tolerance |
y |
A factor or a vector of integer class codes in |
lambda |
Non-negative ODIR regularization strength. When |
eps |
Clipping constant satisfying |
The calibrated probabilities are computed row-wise as
softmax(log(p) %*% t(W) + b), where W is a K by K weight matrix and
b is a length K intercept vector. Probabilities are clipped to
to have lower bound eps and upper bound 1 - eps before taking logarithms.
When lambda is NULL, it is selected from a small deterministic grid by
cross-validated log-likelihood.
Let be the uncalibrated probability assigned to class
for observation . Each row of p must sum to one within absolute
tolerance 1e-6. Column corresponds to integer class code ;
if y is a factor, column corresponds to levels(y)[k]. The
entries are clipped elementwise by
and transformed to . The clipped feature matrix
is not renormalized; normalization occurs only after the linear map, through
the final softmax. Dirichlet calibration fits a multinomial logistic
regression on these log-probability features,
followed by
With fixed , the fitted parameters minimize
This is the off-diagonal and intercept regularization penalty. Diagonal
weights are not penalized. For fixed lambda, optimization uses BFGS with
analytic gradients, initial weight matrix , initial
bias , and maxit = 500. True-class probabilities
entering logarithms are clipped to [1e-15, 1 - 1e-15]. The returned
weight is a matrix whose row produces
the logit for class ; bias is a length- vector of
intercepts. The object also stores lambda, value, and the optimizer
convergence code.
If lambda = NULL, the implementation evaluates the grid
c(0, 1e-4, 1e-3, 1e-2, 1e-1) with at most three deterministic stratified
folds. Class indices are assigned to folds in their existing order. The
selected value minimizes the unweighted average of the fold mean held-out
negative log-likelihoods; ties choose the first grid value. If fewer than two
observations are available in the smallest class during selection, the
fallback value is 1e-3. With lambda = 0, the multinomial softmax
parameterization is not unique: adding the same linear function of the
features to every class logit leaves all probabilities unchanged. The
calibrated probabilities are the identified output.
A cal_dirichlet object that also inherits from cal_multiclass.
Use predict() with new probabilities to obtain calibrated probabilities.
Kull, M., Perello-Nieto, M., Kängsepp, M., Silva Filho, T., Song, H., & Flach, P. (2019). Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration. Advances in Neural Information Processing Systems 32.
set.seed(23) prob <- matrix(stats::runif(200 * 3), ncol = 3) prob <- prob / rowSums(prob) labels <- max.col(prob) fit <- cal_dirichlet(prob, labels) head(predict(fit, prob))set.seed(23) prob <- matrix(stats::runif(200 * 3), ncol = 3) prob <- prob / rowSums(prob) labels <- max.col(prob) fit <- cal_dirichlet(prob, labels) head(predict(fit, prob))
cal_histogram() partitions [0, 1] into bins and replaces each probability
with the empirical event frequency in its bin. Equal-width bins use fixed
intervals. Equal-frequency bins use sample quantiles as break points.
cal_histogram(p, y, bins = 10, strategy = c("equal_width", "equal_freq"))cal_histogram(p, y, bins = 10, strategy = c("equal_width", "equal_freq"))
p |
Numeric vector of uncalibrated probabilities in |
y |
Binary outcome vector coded as |
bins |
Number of bins. Must be a single positive integer. |
strategy |
Binning strategy. Use |
Empty training bins inherit the empirical rate from the nearest non-empty
bin. This makes prediction defined over the whole interval [0, 1].
Histogram binning estimates a piecewise constant calibration map. Given
distinct break points
,
the implementation uses left-closed bins. For ,
and the last bin is
The fitted value for a non-empty bin is the empirical event frequency,
A new probability receives the fitted value of the bin into which it falls.
Values exactly on an internal break point are assigned to the bin that starts
at that break point; the value 1 is assigned to the last bin.
With strategy = "equal_width", the break points are equally spaced on
[0, 1], so when bins = B. With
strategy = "equal_freq", provisional break points are
where is the sample quantile computed by
stats::quantile(type = 8). The first and last break points are then forced
to 0 and 1. Duplicated break points are removed, so the actual number of
bins can be smaller than bins. Empty bins are assigned the value of
the nearest non-empty bin by bin index; if an empty bin is equally close to
two non-empty bins, the lower-index non-empty bin is used. If no non-empty
bin is available, the global event rate is used as a fallback.
The returned object stores the requested bins, the realized actual_bins,
strategy, breaks, per-bin fitted values in bin_values, training
counts, global_rate, and the original call.
A cal_histogram object. Use predict() with new probabilities to
obtain calibrated probabilities.
Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/775047.775151.
set.seed(5) calibration <- data.frame(raw_p = stats::runif(120)) |> dplyr::mutate(y = rbinom(dplyr::n(), 1, raw_p)) fit <- cal_histogram(calibration$raw_p, calibration$y, bins = 5) calibration |> dplyr::mutate(calibrated = predict(fit, raw_p)) |> dplyr::summarise( raw_ece = ece(raw_p, y, bins = 5), calibrated_ece = ece(calibrated, y, bins = 5) )set.seed(5) calibration <- data.frame(raw_p = stats::runif(120)) |> dplyr::mutate(y = rbinom(dplyr::n(), 1, raw_p)) fit <- cal_histogram(calibration$raw_p, calibration$y, bins = 5) calibration |> dplyr::mutate(calibrated = predict(fit, raw_p)) |> dplyr::summarise( raw_ece = ece(raw_p, y, bins = 5), calibrated_ece = ece(calibrated, y, bins = 5) )
cal_isotonic() fits a monotone calibration curve with stats::isoreg().
New probabilities are calibrated by linear interpolation. Predictions below
the training range use the leftmost fitted value; predictions above the range
use the rightmost fitted value.
cal_isotonic(p, y)cal_isotonic(p, y)
p |
Numeric vector of uncalibrated probabilities in |
y |
Binary outcome vector coded as |
Ties in the training probabilities are ordered with positive labels first before isotonic regression and then collapsed to a single fitted value per unique probability.
Isotonic calibration estimates a nondecreasing function that maps raw
probabilities to calibrated event probabilities. Let be the
ordering that sorts observations by increasing and, for equal
, decreasing . Thus positive labels precede negative labels
within a tied probability value. The fitted values solve the projection
problem
The implementation uses stats::isoreg() for the constrained least-squares
problem and clips the fitted values to [0, 1]. The label vector must
contain at least one 0 and one 1.
Prediction uses linear interpolation between the unique training
probabilities and their fitted values. If a new probability is below the
smallest training value, prediction returns the leftmost fitted value. If it
is above the largest training value, prediction returns the rightmost fitted
value. Training ties are collapsed to one fitted value per unique probability
after the isotonic fit by averaging the fitted values within each tied group.
If the training data contain a single unique probability, prediction is the
resulting constant fitted value. The fitted object stores the unique
probabilities in x_thresholds, the collapsed fitted values in
y_calibrated, the stats::isoreg() object in fit, and the original call.
Prediction uses stats::approx(method = "linear") with constant
extrapolation at the two endpoints, so the package prediction rule is the
interpolated monotone curve rather than the unmodified PAVA step function.
A cal_isotonic object. Use predict() with new probabilities to
obtain calibrated probabilities.
Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/775047.775151.
set.seed(4) calibration <- data.frame(raw_p = sort(stats::runif(120))) |> dplyr::mutate(y = rbinom(dplyr::n(), 1, raw_p)) fit <- cal_isotonic(calibration$raw_p, calibration$y) calibration |> dplyr::mutate(calibrated = predict(fit, raw_p)) |> dplyr::summarise( raw_ece = ece(raw_p, y, bins = 10), calibrated_ece = ece(calibrated, y, bins = 10) )set.seed(4) calibration <- data.frame(raw_p = sort(stats::runif(120))) |> dplyr::mutate(y = rbinom(dplyr::n(), 1, raw_p)) fit <- cal_isotonic(calibration$raw_p, calibration$y) calibration |> dplyr::mutate(calibrated = predict(fit, raw_p)) |> dplyr::summarise( raw_ece = ece(raw_p, y, bins = 10), calibrated_ece = ece(calibrated, y, bins = 10) )
cal_ovr() extends any binary calibrator to a multiclass problem with the
one-vs-rest reduction. For each class it fits a binary calibrator that
separates that class from the others, applies the calibrators column by
column, and renormalizes each row to sum to one. This is the default strategy
that binning methods use for multiclass calibration.
cal_ovr( x, y, method = c("platt", "beta", "isotonic", "histogram", "temperature"), ... )cal_ovr( x, y, method = c("platt", "beta", "isotonic", "histogram", "temperature"), ... )
x |
Numeric matrix of uncalibrated values with one row per observation
and one column per class. For |
y |
A factor or a vector of integer class codes in |
method |
Binary calibrator applied to each one-vs-rest problem. |
... |
Additional arguments passed to the binary calibrator, such as
|
The columns of x are the per-class uncalibrated values. Use scores or
probabilities for method = "platt", probabilities for "beta",
"isotonic", and "histogram", and binary one-vs-rest logits for
"temperature". Rows of x are not required to sum to one. Every class
must appear at least once in y, because each one-vs-rest problem needs both
labels.
For classes, column of x corresponds to integer class code
; if y is a factor, column corresponds to levels(y)[k].
One-vs-rest calibration creates binary labels,
A separate binary calibrator is fitted to column of x and
the binary labels . On new data, the classwise calibrated
scores are
Because the binary calibrators are fitted independently, the row sums
of need not equal one. Let
. If
is finite and positive, the final multiclass probabilities are
renormalized by row,
If is zero or non-finite, the prediction for that row is replaced
by the uniform distribution . This fallback
keeps the output on the probability simplex. The renormalization changes the
individual values unless , so final
columns should not be interpreted as the raw outputs of the independently
calibrated binary problems. The renormalized probabilities are
simplex-valued, but the one-vs-rest reduction does not by itself guarantee
joint multiclass calibration.
A cal_ovr object that also inherits from cal_multiclass. The
object stores calibrators, base_method, k, levels, input, and the
original call. Use predict() with a new score matrix to obtain a numeric
matrix of calibrated probabilities whose rows sum to one.
Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/775047.775151.
set.seed(21) raw <- matrix(stats::runif(150 * 3), ncol = 3) raw <- raw / rowSums(raw) labels <- max.col(raw) fit <- cal_ovr(raw, labels, method = "isotonic") calibrated <- predict(fit, raw) head(calibrated)set.seed(21) raw <- matrix(stats::runif(150 * 3), ncol = 3) raw <- raw / rowSums(raw) labels <- max.col(raw) fit <- cal_ovr(raw, labels, method = "isotonic") calibrated <- predict(fit, raw) head(calibrated)
cal_platt() fits a logistic regression that maps an uncalibrated score to
a calibrated probability. The binary targets are adjusted with Platt's target
correction before fitting, which shrinks labels away from exact 0 and 1.
cal_platt(x, y)cal_platt(x, y)
x |
Numeric vector of uncalibrated scores or raw probabilities. |
y |
Binary outcome vector coded as |
Let
be the calibration sample, where is the supplied score and
is the observed label. Write
and
. Platt's correction replaces the
binary labels by fractional targets. Positive labels use
and negative labels use
Thus when
and when . The
fitted logistic map is
and are estimated by minimizing the
binomial cross-entropy with the corrected fractional targets,
The implementation fits this model with stats::glm() using the formula
y_adj ~ x. The label vector must contain at least one 0 and one 1.
The returned object stores coefficients, where (Intercept) is
and x is , as
well as the full glm object in fit and the corrected targets
target_pos and target_neg. Prediction applies
to new scores. The argument x may
be a score on any real-valued scale or a raw probability, but the fitted map
is always a logistic function of the supplied values. The slope is
unconstrained; the fitted map is increasing in x only when
.
A cal_platt object. Use predict() with new scores to obtain
calibrated probabilities.
Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers.
set.seed(1) calibration <- data.frame(score = rnorm(120)) |> dplyr::mutate( truth = inv_logit(score), y = rbinom(dplyr::n(), 1, truth) ) fit <- cal_platt(calibration$score, calibration$y) calibration |> dplyr::mutate(calibrated = predict(fit, score)) |> dplyr::summarise(ece = ece(calibrated, y, bins = 10))set.seed(1) calibration <- data.frame(score = rnorm(120)) |> dplyr::mutate( truth = inv_logit(score), y = rbinom(dplyr::n(), 1, truth) ) fit <- cal_platt(calibration$score, calibration$y) calibration |> dplyr::mutate(calibrated = predict(fit, score)) |> dplyr::summarise(ece = ece(calibrated, y, bins = 10))
cal_temperature() estimates a single positive temperature parameter by
minimizing the negative log-likelihood. Inputs must be logits, not
probabilities. For binary probabilities, logit() gives the corresponding
logit. For strictly positive multiclass probability rows,
is a valid softmax logit
representation, up to row-wise additive constants. If probabilities have
zero entries, the user must choose and supply a transformed logit matrix,
such as clipped log-probabilities. cal_temperature() does not accept or
clip probability matrices.
cal_temperature(logits, y)cal_temperature(logits, y)
logits |
For binary calibration, a numeric vector of uncalibrated logits. For multiclass calibration, a numeric matrix of logits with one row per observation and one column per class. |
y |
Outcome labels. For binary calibration, a vector coded as |
The function handles both binary and multiclass problems through the type of
logits. A numeric vector triggers binary temperature scaling and the
calibrated probability is inv_logit(logits / T). A numeric matrix with one
column per class triggers multiclass temperature scaling and the calibrated
probabilities are softmax(logits / T). Because dividing every logit by the
same positive scalar preserves the row ordering and argmax, temperature
scaling leaves the predicted class unchanged apart from existing ties and
only sharpens or softens the probabilities.
In the binary case, let be an uncalibrated logit. For a positive
temperature , the calibrated event probability is
The fitted temperature is found by a bounded one-dimensional optimization on
:
In the multiclass case, let be the logit for class and
observation . The calibrated probabilities are
and is chosen by minimizing the average multiclass negative
log-likelihood over the same interval,
For multiclass labels, column of the logit matrix corresponds to
class code . If y is a factor, the stored order of levels(y)
defines the column order. The numerical objective clips probabilities that
enter logarithms to [1e-15, 1 - 1e-15]. The optimization uses
stats::optim() with method "Brent" and initial value 1 on the bounded
interval above. The returned object stores temperature, the optimizer
value, and the optimizer convergence code; multiclass fits also store
k and levels.
Values soften the probability vector, while values
make it more concentrated. Dividing all class logits by the
same positive constant preserves their order, so the predicted class is
unchanged apart from ties already present in the logits.
A cal_temperature object. Use predict() with new logits to obtain
calibrated probabilities. Multiclass objects also inherit from
cal_multiclass.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. Proceedings of the 34th International Conference on Machine Learning.
set.seed(2) calibration <- data.frame(logits = rnorm(120)) |> dplyr::mutate( raw_p = inv_logit(logits), y = rbinom(dplyr::n(), 1, raw_p) ) fit <- cal_temperature(calibration$logits, calibration$y) calibration |> dplyr::mutate(calibrated = predict(fit, logits)) |> dplyr::summarise( raw_ece = ece(raw_p, y, bins = 10), calibrated_ece = ece(calibrated, y, bins = 10) ) # Multiclass temperature scaling with a logit matrix and integer labels. set.seed(20) logits <- matrix(rnorm(150 * 3), ncol = 3) labels <- max.col(logits) # integer codes in 1:3 mc_fit <- cal_temperature(logits, labels) head(predict(mc_fit, logits))set.seed(2) calibration <- data.frame(logits = rnorm(120)) |> dplyr::mutate( raw_p = inv_logit(logits), y = rbinom(dplyr::n(), 1, raw_p) ) fit <- cal_temperature(calibration$logits, calibration$y) calibration |> dplyr::mutate(calibrated = predict(fit, logits)) |> dplyr::summarise( raw_ece = ece(raw_p, y, bins = 10), calibrated_ece = ece(calibrated, y, bins = 10) ) # Multiclass temperature scaling with a logit matrix and integer labels. set.seed(20) logits <- matrix(rnorm(150 * 3), ncol = 3) labels <- max.col(logits) # integer codes in 1:3 mc_fit <- cal_temperature(logits, labels) head(predict(mc_fit, logits))
cal_vector_scaling() is the multiclass generalization of temperature
scaling that gives each class its own scale and bias. It rescales a logit
matrix column by column and applies the softmax. With a single shared scale
and no bias it reduces to temperature scaling, so it is more flexible while
remaining cheap to fit.
cal_vector_scaling(logits, y)cal_vector_scaling(logits, y)
logits |
Numeric matrix of uncalibrated logits with one row per observation and one column per class. |
y |
A factor or a vector of integer class codes in |
The calibrated probabilities are softmax(s * logits + b), where s is a
length K vector of per-class scales applied column by column and b is a
length K vector of per-class biases. Parameters are estimated by minimizing
the average multiclass negative log-likelihood.
Let be the uncalibrated logit for observation and class
. Vector scaling estimates class-specific scales and
intercepts , then forms calibrated logits
The predicted probabilities are obtained with the softmax,
Parameters are estimated by minimizing
For multiclass labels, column of logits corresponds to class code
; if y is a factor, column corresponds to levels(y)[k].
The implementation uses stats::optim() with method "BFGS", analytic
gradients, initial scales , initial biases
, and maxit = 500. True-class probabilities entering
logarithms are clipped to [1e-15, 1 - 1e-15]. The returned object stores
scale, bias, the optimized average negative log-likelihood value, and
the optimizer convergence code.
The scales are unconstrained in the fitted optimization, so a negative scale is possible when it improves the likelihood on the calibration data. Unlike temperature scaling, vector scaling can change the predicted class because scales and biases vary by class. As with any softmax model, adding the same constant to every class bias does not change the resulting probability vector, so the fitted bias vector is identifiable only up to a common additive constant.
A cal_vector_scaling object that also inherits from
cal_multiclass. Use predict() with new logits to obtain calibrated
probabilities.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. Proceedings of the 34th International Conference on Machine Learning.
set.seed(22) logits <- matrix(rnorm(200 * 3), ncol = 3) labels <- max.col(logits) fit <- cal_vector_scaling(logits, labels) head(predict(fit, logits))set.seed(22) logits <- matrix(rnorm(200 * 3), ncol = 3) labels <- max.col(logits) fit <- cal_vector_scaling(logits, labels) head(predict(fit, logits))
ece() returns the empirical weighted average gap between mean confidence
and empirical event frequency across equal-width probability bins. It is zero
when confidence and accuracy match in every non-empty bin of the chosen
partition.
ece(p, y, bins = 10, type = c("classwise", "confidence"))ece(p, y, bins = 10, type = c("classwise", "confidence"))
p |
Predicted probabilities. A numeric vector in |
y |
Outcome labels. A vector coded as |
bins |
Number of equal-width bins on |
type |
Multiclass aggregation, either |
For binary problems p is a probability vector. For multiclass problems p
is a probability matrix with one column per class and type selects the
multiclass definition. The "classwise" form averages the binary ECE over
the one-vs-rest columns, also known as the static calibration error. The
"confidence" form applies the binary ECE to the top-label confidence and
whether the predicted class is correct, which is the definition used by Guo
et al. (2017).
For binary calibration, the interval [0, 1] is split into
equal-width bins. The package uses left-closed bins,
for , and
for the last bin. Let and
. For each non-empty bin,
and
The returned empirical ECE is
Empty bins have zero weight. The estimate depends on bins; changing the
number of bins changes the empirical partition and can change the value. A
value of zero means equality of sample bin means for this partition, not full
population calibration.
For a probability matrix, type = "classwise" computes the binary ECE for
each one-vs-rest column against
and returns their
arithmetic mean,
type = "confidence" uses the top-label rule
,
the confidence , and the
correctness indicator
, then
applies the binary definition to :
.
For matrix inputs, column corresponds to integer class code ;
if y is a factor, column corresponds to levels(y)[k].
Here "calibrated" refers to the output of a fitted calibration map. It does
not imply population calibration. Binary population calibration can be stated
as for the predicted probability random
variable . For top-label confidence , the analogous condition
is .
A single numeric value.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. Proceedings of the 34th International Conference on Machine Learning.
predictions <- data.frame( p = c(0.10, 0.20, 0.80, 0.90), y = c(0, 0, 1, 1) ) predictions |> dplyr::summarise(ece = ece(p, y, bins = 2)) # Multiclass classwise ECE from a probability matrix. set.seed(30) prob <- matrix(stats::runif(150 * 3), ncol = 3) prob <- prob / rowSums(prob) labels <- max.col(prob) ece(prob, labels, bins = 10, type = "classwise")predictions <- data.frame( p = c(0.10, 0.20, 0.80, 0.90), y = c(0, 0, 1, 1) ) predictions |> dplyr::summarise(ece = ece(p, y, bins = 2)) # Multiclass classwise ECE from a probability matrix. set.seed(30) prob <- matrix(stats::runif(150 * 3), ncol = 3) prob <- prob / rowSums(prob) labels <- max.col(prob) ece(prob, labels, bins = 10, type = "classwise")
inv_logit() maps finite real values to probabilities. Mathematically the
range is (0, 1), although floating-point results can round to 0 or 1
for extreme finite inputs. It is used by temperature scaling and by the
parametric calibrators fitted with logistic regression.
inv_logit(x)inv_logit(x)
x |
Numeric vector on the logit scale. |
The inverse logit, also called the logistic function, is
It maps real-valued scores to probabilities, is monotone increasing, and
satisfies . The implementation uses
stats::plogis(), which evaluates the same transformation with stable
numerical handling for large positive or negative inputs. The implementation
accepts finite numeric inputs only; infinite values are rejected even though
the mathematical limits of the logistic function are defined. The returned
vector has the same length as x.
A numeric vector of probabilities with the same length as x.
scores <- data.frame(logit_score = c(-2, -1, 0, 1, 2)) |> dplyr::mutate(probability = inv_logit(logit_score)) scoresscores <- data.frame(logit_score = c(-2, -1, 0, 1, 2)) |> dplyr::mutate(probability = inv_logit(logit_score)) scores
logit() maps probabilities from (0, 1) to the real line. Inputs must lie
in [0, 1]; values outside this probability interval are rejected. Valid
probabilities below eps and above 1 - eps are clipped before the
transformation, because the mathematical logit is infinite at the boundary.
logit(p, eps = .Machine$double.eps)logit(p, eps = .Machine$double.eps)
p |
Numeric vector of probabilities in |
eps |
Positive clipping constant in |
For a probability , the logit is
The transformation is monotone increasing and maps probabilities below
to negative values, to zero, and probabilities above
to positive values. Because the expression is not finite at
or , the implementation first computes
where is eps, and then returns
. The returned vector has the same length as
p.
A numeric vector on the logit scale with the same length as p.
probabilities <- data.frame(p = c(0.05, 0.25, 0.5, 0.75, 0.95)) |> dplyr::mutate( logit_p = logit(p), recovered = inv_logit(logit_p) ) probabilitiesprobabilities <- data.frame(p = c(0.05, 0.25, 0.5, 0.75, 0.95)) |> dplyr::mutate( logit_p = logit(p), recovered = inv_logit(logit_p) ) probabilities
mce() returns the largest empirical absolute gap between mean confidence
and empirical event frequency among non-empty equal-width bins. For
multiclass inputs the "classwise" form returns the largest binary MCE
across the one-vs-rest columns and the "confidence" form uses the
top-label confidence.
mce(p, y, bins = 10, type = c("classwise", "confidence"))mce(p, y, bins = 10, type = c("classwise", "confidence"))
p |
Predicted probabilities. A numeric vector in |
y |
Outcome labels. A vector coded as |
bins |
Number of equal-width bins on |
type |
Multiclass aggregation, either |
Using the same bin notation and endpoint convention as ece(), the binary
empirical maximum calibration error is
Empty bins are ignored. For a multiclass probability matrix,
type = "classwise" returns the maximum of the one-vs-rest binary MCE values
across classes,
type = "confidence" returns
using the top-label confidence and correctness variables defined in ece().
A single numeric value.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. Proceedings of the 34th International Conference on Machine Learning.
predictions <- data.frame( p = c(0.10, 0.20, 0.80, 0.90), y = c(0, 0, 1, 1) ) predictions |> dplyr::summarise(mce = mce(p, y, bins = 2))predictions <- data.frame( p = c(0.10, 0.20, 0.80, 0.90), y = c(0, 0, 1, 1) ) predictions |> dplyr::summarise(mce = mce(p, y, bins = 2))
mmce() is a binning-free empirical calibration statistic built from a
kernel mean embedding of the calibration error. Unlike ece(), it does not
partition the probability space into bins, so it avoids sensitivity to the
number and placement of bins. It still depends on the kernel and bandwidth.
The returned value is an empirical kernel statistic, not a population
calibration parameter by itself.
mmce(p, y, bandwidth = 0.2)mmce(p, y, bandwidth = 0.2)
p |
Predicted probabilities. A numeric vector in |
y |
Outcome labels. A vector coded as |
bandwidth |
Positive finite scalar bandwidth of the Laplacian kernel. |
For a binary input the residual compares the event indicator y with the
predicted event probability p. For a multiclass probability matrix the
confidence is the top-label probability and correctness indicates whether the
predicted class is right. For multiclass inputs, mmce() implements only
this top-label confidence form; there is no classwise type argument. The
statistic uses a Laplacian kernel
. The computation builds an
observation by observation kernel matrix, so both time and memory scale as
.
Let be the scalar probability assigned to observation and
the corresponding binary target. In the binary case,
and . In the multiclass case, ties are broken
by the first class,
,
, and
.
The residual used by the statistic is
With the Laplacian kernel
where is bandwidth, the returned value is the V-statistic plug-in
estimate with diagonal terms,
The square-root argument is truncated at zero after numerical computation to avoid negative values caused only by floating-point error, so the returned value is nonnegative.
A single numeric value.
Kumar, A., Sarawagi, S., & Jain, U. (2018). Trainable calibration measures for neural networks from kernel mean embeddings. Proceedings of the 35th International Conference on Machine Learning.
set.seed(31) p <- stats::runif(200) y <- rbinom(200, 1, p) mmce(p, y)set.seed(31) p <- stats::runif(200) y <- rbinom(200, 1, p) mmce(p, y)
reliability_diagram() returns a ggplot2 object comparing mean predicted
confidence with the observed event frequency in equal-width probability bins.
By default, points are sized by the number of observations in each non-empty
bin and the subtitle reports the ECE computed with the same bins.
reliability_diagram( p, y, bins = 10, show_ece = TRUE, show_counts = TRUE, type = c("classwise", "confidence") )reliability_diagram( p, y, bins = 10, show_ece = TRUE, show_counts = TRUE, type = c("classwise", "confidence") )
p |
Predicted probabilities. A numeric vector in |
y |
Outcome labels. A vector coded as |
bins |
Number of equal-width bins on |
show_ece |
Logical. If |
show_counts |
Logical. If |
type |
Multiclass layout, either |
For a probability matrix the function builds a multiclass diagram. The
"classwise" form draws one panel per class from the one-vs-rest view. The
"confidence" form draws a single panel from the top-label confidence and
whether the predicted class is correct.
The diagram is a visual version of the binned summaries used by ece(). For
binary inputs, the package uses the same left-closed equal-width bins as
ece(), with the last bin closed on the right. For each non-empty bin
, the x-coordinate is the mean predicted probability,
and the y-coordinate is the observed event frequency,
Points near the diagonal line have similar average confidence and empirical frequency within the bin. Points below the diagonal indicate over-confident predictions in that bin, and points above the diagonal indicate under-confident predictions. Empty bins are omitted from the plotted data. The diagonal reference line is the set where the bin mean predicted probability equals the empirical event frequency.
For multiclass inputs, type = "classwise" builds these summaries separately
for each one-vs-rest class and displays them in facets. type = "confidence"
replaces by the top-label probability and by the
indicator that the top-label prediction is correct. Ties in the top-label
rule are broken by the first column, matching max.col(..., ties.method = "first"). When show_ece = TRUE, the subtitle reports
ece(p, y, bins = bins) for binary inputs and
ece(p, y, bins = bins, type = type) for multiclass inputs.
A ggplot object.
Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. Proceedings of the 22nd International Conference on Machine Learning.
set.seed(6) predictions <- data.frame(raw_p = stats::runif(120)) |> dplyr::mutate(y = rbinom(dplyr::n(), 1, raw_p)) reliability_diagram(predictions$raw_p, predictions$y, bins = 8) # Multiclass reliability diagram with one panel per class. set.seed(60) prob <- matrix(stats::runif(150 * 3), ncol = 3) prob <- prob / rowSums(prob) labels <- max.col(prob) reliability_diagram(prob, labels, bins = 8, type = "classwise")set.seed(6) predictions <- data.frame(raw_p = stats::runif(120)) |> dplyr::mutate(y = rbinom(dplyr::n(), 1, raw_p)) reliability_diagram(predictions$raw_p, predictions$y, bins = 8) # Multiclass reliability diagram with one panel per class. set.seed(60) prob <- matrix(stats::runif(150 * 3), ncol = 3) prob <- prob / rowSums(prob) labels <- max.col(prob) reliability_diagram(prob, labels, bins = 8, type = "classwise")