Calibrating Regression Uncertainty via σ Scaling
Posted on
apply estimation of predictive uncertainty by variational Bayesian inference with Monte Carlo dropout to regression tasks and show why predictive uncertainty is systematically underestimated
suggest using $\sigma$ scaling with a single scalar value
Introduction
- aim to estimate a continuous target value $y\in \IR^d$ given an input image $x$
- Bayesian neural networks (BNN) and their approximation provide mathematical tools for reasoning the uncertainty
- in general, predictive uncertainty can be split into two types:
- aleatoric uncertainty
- epistemic uncertainty
- a well-accepted approach to quantify epistemic uncertainty is variational inference with Monte Carlo dropout, where dropout is used at test time to sample from the approximate posterior
- however, uncertainty obtained by deep BNNs tends to be miscalibrated,
instead of exactly model error, one might only consider the ranking, so some calibration for ranking?
Platt scaling: calibration of uncertainty in regression
- given a pre-trained, miscalibrated model $H$, an auxiliary model $R: [0, 1]^d \rightarrow [0, 1]^d$ is trained, that yields a calibrated regressor $R\circ H$
- this was applied to bounding box regression
- an auxiliary model with enough capacity will always be able to recalibrate, even if the predicted uncertainty is completely uncorrelated with the real uncertainty
- calibrate via $R$ is possible if enough independent and i.i.d. data is available
- in medical imaging, large data sets are usually hard to obtain, which can cause $R$ to overfit the calibration set
the main contributions of the paper:
- analyze and provide theoretical background why deep models for regression are miscalibrated with regard to predictive uncertainty
- suggest to use $\sigma$ scaling in a separate calibration phase to tackle underestimation of uncertainty
- perform extensive experiments on four different datasets
Methods
2.1 Conditional Log-likelihood for Regression
revisit regression under the MAP framework to derive direct estimation of heteroscedastic aleatoric uncertainty
the goal of the regression model is to predict a target value $y$ given some new input $x$ and a training set $\cD$ of $m$ inputs ${x_1,\ldots, x_m}$ and their corresponding (observed) target values $\{y_1,\ldots, y_m\}$
assume that $y$ has a Gaussian distribution $N(y; \hat y(x), \hat\sigma^2(x))$ with mean $\hat y(x)$ and variance $\hat\sigma(x)$
a neural network with parameters $\theta$
\[f_\theta(x) = [\hat y(x), \hat\sigma^2(x)], \hat y\in \IR^d, \hat\sigma^2 \ge 0\]outputs these values for a given input.
did not consider the bias between $y$ and $\hat y(x)$?
with $m$ i.i.d. random samples, the conditional log-likelihood is given by
it is equivalent to minimizing the negative log-likelihood
in this case, $\hat\sigma_\theta$ captures the uncertainty that is inherent in the data (aleatoric uncertainty)
2.2 Biased estimation of $\sigma$
ignoring the dependence through $\theta$, the solution decouples estimation of $\hat y$ and $\hat\sigma$.
definitely
2.3 $\sigma$ Scaling for Aleatoric Uncertainty
2.4 Well-Calibrated Estimation of Predictive Uncertainty
so far the MAP point estimate for $\theta$ which does not consider uncertainty in the parameters
to quantify both aleatoric and epistemic uncertainty, extend $f_\theta$ into a fully Bayesian model under the variational inference framework with Monte Carlo dropout.
In MC dropout, the model $f_{\tilde\theta}$ is trained with dropout and dropout is applied at test time by performing $N$ stochastic forward passes to sample from the approximate Bayesian posterior $\tilde\theta\sim q(\theta)$
use GBS instead for bootstrap?
apply $\sigma$ scaling to recalibrate the predictive uncertainty $\hat\Sigma^2$
this allows a lower squared error but reduces underestimation of uncertainty
2.5 Expected Uncertainty Calibration Error for Regression
Experiments & Results
Discussion & Conclusion
- well-calibrated uncertainty from MC dropout is able to reliably detect a shift in the data distribution
- $\sigma$ scaling is simple to implement, does not change the predictive mean $\hat y$, and does not affect the model accuracy
- many factors (e.g., network capacity, weight decay, dropout configuration) influencing the uncertainty that have not been discussed here