Likelihood Annealing
Posted on
- deep learning approaches that allow uncertainty estimation for regression problems often converge slowly and yield poorly calibrated uncertainty estimates that can not be effectively used for quantification.
- the work presents a fast calibrated uncertainty estimation method for regression tasks called Likelihood Annealing, that consistently improves the convergence of deep regression models and yields calibrated uncertainty without any post hoc calibration phase
Introduction
various formulations to provide accurate predictions for deep neural networks
- Bayesian approaches
- pseudo-ensembles
- quantile regression
revisit deep regrssion models trained via MLE, which assumes a Gaussian distribution over the regression output and optimizes the negative log-likelihood to estimate the target and uncertainty.
- they often converge slowly at the beginning of training due to a flat gradient landscape
- they may even risk gradient explosion caused by a steep gradient landscape when reaching the optima
to reshape the aformentioned ill-posed gradient landscape that causes slow convergence and poorly calibrated uncertainty, the paper proposes a novel Likelihood Annealing (LIKA) scheme for deep regression models that alters the original gradients by formulating a temperature-dependent improper likelihood to be optimized during the learning phase
the proposed temperature-dependent likelihood brings crucial properties to regression uncertainty
- the multimodal distribution on the regression target ensures that at high residuals (between output and ground truth, occurring in the initial learning phase), the gradients are much larger than the standard unimodal Gaussian distribution leading to faster convergence at the beginning of the learning phase
- anneal the learning rate over the course of training along with the temperature that avoid gradient explosion towards the end of the learning phase, a problem with the standard heteroscedastic Gaussian-based likelihood distribution with sharp gradients at lower errors
- construct the temperature-dependent likelihood such that the predicted uncertainty is encouraged to be calibrated at every step, by being close to the error between the prediction and ground truth
Related Work
- DNN typically estimate inaccurate uncertainty due to their deterministic form that is insufficient for characterizing the accurate confidence
- Bayesian inference
- approximate inference
- model two terms, predictive mean and variance, as an output of DNN to estimate the uncertainty directly from the network’s output
- estimate different quantile for a given input
- conformal predictions
two types of uncertainties in deep learning
- Aleatoric: the uncertainty that arises from the inherent randomness in the data
- Epistemic: the uncertainty that arises due to a lack of knowledge or information about the data
calibrating the inaccurate uncertainty in another way to estimate accurate uncertainty
the estimated credible interval with confidence level $\alpha$ is calibrated if $\alpha\%$ of the ground-truth target is covered in that interval.
post-processing methods for regression calibration:
- introduce an auxiliary model to adjust the output of the pre-trained model based on Platt-scaling, while others use Gaussian process or maximum mean discrepancy
- an auxiliary model with enough capacity will always be able to recalibrate, even if the predicted uncertainty is completely uncorrelated with the real uncertainty
Methodology: Likelihood Annealing
Likelihood Annealing (LIKA) belongs to the family of models that are designed to predict a distribution for the outputs, and the model is trained via a loss function derived from MLE
Kendall & Gal (2017): relax the i.i.d. assumption and learn to model the heteroscedasticity as well
assume the residual $\epsilon_i\sim N(0, \hat\sigma_i)$, the likelihood is a factored Gaussian distribution
\[P(\cD\mid \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi \hat \sigma_i^2}}\exp\left(-\frac{\vert y_i - \hat y_i\vert^2}{2\hat\sigma_i^2}\right)\]the DNN is modified to output both the prediction (i.e., the mean of Gaussian) as well as the uncertainty estimate (i.e., the variance of Gaussian) learned using the above equation, i.e., $\Psi(x_i, \theta) = {\hat y_i, \hat\sigma_i}$
constructing temperature dependent improper likelihood
take the negative log of improper likelihood
which can be rewritten as
Experiments
Evaluation Metrics
to measure the quality of uncertainty estimates $(\hat\sigma^2)$, compute
- the correlation coefficient (Corr. Coeff.) between uncertainty estimates $(\hat\sigma^2)$ and the error $(\vert \hat y - y\vert^2)$
- uncertainty calibration error (UCE) for regression tasks. the uncertainty output $\hat \sigma^2$ of a deep model is partitioned into $M$ bins
- a weighted average of the difference between the predictive error and uncertainty is used $UCE = \sum_{m=1}^M \frac{\vert B_m\vert}{N}\vert err(B_m) - uncer(B_m)\vert$, where $err(B_m) = \frac{1}{\vert B_m\vert}\sum_{i\in B_m}\Vert \hat y_i - y_i\Vert^2$ and $uncer(B_m) = \frac{1}{\vert B_m\vert}\sum_{i\in B_m}\hat\sigma_i^2$
- UCE for the re-calibarated estimates
- expected calibration error
- sharpness