Direct Epistemic Uncertainty Prediction
Posted on
Epistemic uncertainty: a measure of the lack of knowledge of a learner which diminishes with more evidence
- while existing work focuses on using the variance of the Bayesian posterior due to parameter uncertainty as a measure of epistemic uncertainty, the paper argues that this does not capture the part of lack of knowledge induced by model misspecification
the paper
- discuss how the excess risk, which is the gap between the generalization error of a predictor and the Bayes predictor: a sound measure of epistemic uncertainty which captures the effect of model misspecification
- propose a principled framework for directly estimating the excess risk by learning a secondary predictor for the generalization error and subtracting an estimate of aleatoric uncertainty, i.e., intrinsic unpredictability
- discuss the merits of the novel measure of epistemic uncertainty, and highlight how it differs from variance-based measures of epistemic uncertainty
- the framework Direct Epistemic Uncertainty Prediction (DEUP) is particularly interesting in interactive learning environments, where the learner is allowed to acquire novel examples in each round
- through a wide set of experiments, illustrate how existing methods in sequantial model optimization can be improved with epistemic uncertainty estimates from DEUP, and how DEUP can be used to drive exploration in reinforcement learning. Also evaluate the quality of uncertainty estimates from DEUP for probabilistic image classification and predicting synergies of drug combinations
Introduction
- epistemic uncertainty (EU): a measure of lack of knowledge that an active learner should minimize
- aleatoric uncertainty: an intrinsic notion of randomness and is irreducible by nature.
- EU can be potentially reduced with additional information
EU estimation is a key ingredient in interactive decision making settings such as
- active learning
- sequential model optimization (SMO)
- reinforcement learning (RL)
What is uncertainty and how to quantify it?
- MC-Dropout and Deep Ensembles, both of which are approximate Bayesian methods, use the posterior predictive variance as a measure of uncertainty: if multiple neural nets that are all compatible with the data make different predictions at $x$, the discrepancy between these predictions is a strong indicator of epistemic uncertainty at $x$
Pitfalls of using the Bayesian posterior to estimate uncertainty
epistemic uncertainty in a predictive model, seen as a measure of lack of knowledge, consists of
- approximation uncertainty: due to the finite size of the training dataset
- model uncertainty: due to model misspecification
approximate bayesian methods often suffer from model misspecification, e.g., due to the implicit bias induced by SGD and the finite computational time
contributions:
- analyze the pitfalls of using discrepancy-based measures of EU, given that they miss out on model complexity, which is defined as a component of the excess risk, i.e., the gap between the risk (or out-of-sample loss) of the predictor at point $x$ and that of the Bayes predictor at $x$ (the one with the lowest expected loss, that no amount of additional data could reduce)
- consider the fundamental notion of epistemic uncertainty as lack of knowledge, and based on this, they proposed to estimate the excess risk as a measure of EU. Propose DEUP (Direct Epistemic Uncertainty Prediction), where one train a secondary model, called the error predictor, with an approximate objective and approximate data, to estimate the point-wise generalization error (the risk), and then subtract an estimate of aleatoric uncertainty if available, or provide an upper bound on EU.
accounting for bias when measuring EU is particularly useful to an interactive learner whose effective capacity depends on the training data, especially in regions of the input space one care about
- DEUP is agnostic to the type of predictors used, and in interactive settings, it is agnostic to the particular search method still needed to select points with high EU in order to propose new candidates for active learning, SMO, or exploration in RL
a unique advantage of DEUP, compared with discrepancy-based measure of EU, is that it can be explicitly trained to care about, and calibrate for estimating the uncertainty for examples which may come from a distribution slightly different from the distribution of most of the training examples.
- such distribution shifts (referred to as feedback covariate shift) arise naturally in interactive contexts such as RL, because the learner explores new areas of the input space
- in these non-stationary settings, one typically wants to retrain the main predictor as one acquire new training data, not just because more training data is generally better but also better track the changing distribution and generalize to yet unseen but upcoming OOD inputs
- a large error initially made at a point $x$ before $x$ is incorporated in the training set will typically be greatly reduced after updating the main predictor with $(x, y)$.
- use additional features as input to the error predictor, that are informative of both the input point and the dataset used to obtain the current predictor
2. Excess Risk, Epistemic Uncertainty, and Model Misspecification
2.1 Notations and Background
- outputs $y\in \cY$
- inputs $x\in \cX\subset \IR^d$
- function $f: \cX\rightarrow \cA$
- training dataset: $z^N = (z_1,\ldots, z_N)\in \IR \cZ^N$, where $\cZ = \cX\times \cY$ and $z_i = (x_i, y_i)$
- unknown ground-truth generative model: $P(X, Y) = P(X)P(Y\mid X)$
- set $\cA$: the action space
- equal to $\cY$ for regression
- equal to $\Delta(\cY)$, the set of probability distributions on $\cY$ for classification problems
- loss function: $l: \cY\times \cA\rightarrow\IR^+$
- $l(y, a) = \Vert y-a\Vert^2$ for regression
- $l(y, a) = -\log a(y)$ for classification
the point-wise risk (or expected loss) of a predictor $f$ at $x\in\cX$ is defined as
\[R(f, x) = \bbE_{P(Y\mid X=x)}[l(Y, f(x))]\]define the risk (total expected loss) of a predictor as the marginalization over $x$:
\[R(f) = \bbE_{P(X, Y)}[l(Y, f(X))]\]given a hypothesis space $\cH$, a subset of $\cF(\cX, \cA)$, the set of function $f$
the goal of any learning algorithm (or learner) $\cL$ is to find a predictive function $h^*\in \cH$ with the lowest possible risk
\[h^* = \argmin_{h\in\cH} R(h)\]in practice, the learner $\cL(z^N): h_{z^N}\in \cH$ minimizing an approximization of the risk, called the empirical risk
\[R_{z^N}(h) = \frac{1}{N}\sum_{i=1}^N l(y_i, h(x_i))\,, \qquad \text{where} z_i = (x_i, y_i)\\ h_{z^N} = \argmin_{h\in \cH} R_{z^N}(h)\]the dataset $z^N$ need not be used solely to define the empirical risk, e.g., the learner can use a subset as a validation set to tune its hyperparameter
2.2 Sources of lack of knowledge
- Bayes predictor: $f^*(x) = \argmin_{a\in A}\bbE_{P(Y\mid X=x)}[l(Y, a)]$.
- $R(f^*, x) > 0$ indicates an irreducible risk due to the inherent randomness of $P(Y\mid X=x)$.
- $R(f^*, x)$ is thus a measure of aleatoric uncertainty at $x$
- maybe more than one Bayes predictor, but they have the same point-wise risk, $A(x) = R(f^*, x)$
- minimizing the risk over $\cH$ rather than $\cF(\cX, \cA)$ induces a discrepancy between $h^$ (the optimal in $\cH$) and $f^$ (the Bayes predictor), usually referred to as model uncertainty. this can be seen as a form of bias, as the optimization is limited to functions in $\cH$
- discrepancy between $h_{z^N}$ and $h^\star$ called the approximation uncertainty, due both to the finite training set and finite comupuational resources for training
excess risk: \(ER(f, x) = R(f, x) - A(x)\)
an estimator of the excess risk of $f$ at $x$ can be used as a measure of epistemic uncertainty
using an estimator of ER as a measure of EU is the main idea behind DEUP
\(P(Y\mid X=x) = N(Y; \mu(x), \sigma^2(x))\) Bayes predictor $f^*=\mu$ \(R(f, x) = \bbE_{P(Y\mid X=x)}[Y-f(x)]^2 = \bbE[Y^2] + \Var(Y) -2f(x)\bbE Y + f^2(x) = \sigma^2(x) + (f(x) - \mu(x))^2\) so \(ER(f, x) = (f(x) - \mu(x))^2\)
for classification \(R(f, x) = \bbH(\mu(x), f(x))\\ ER(f, x) = D_{KL}(\mu(x), f(x))\)
model misspecification if $f^\star\in \cH$, where $f^*$ is the Bayes predictor
-
there is no agreed upon measure of misspecification, some authors focus on Bayesian or approximate Bayesian learners, which maintain a distribution over predictors, and use a discrepancy measure between the best reachable posterior predictive $p(Y\mid X)$ defined by function $h\in \cH$, and the group-truth likelihood $P(Y\mid X)$
-
alternatively, assuming the function space $\cF(\cX, \cA)$ is endowed with a metric, misspecification (bias) can be defined as the distance between $h^$ and $f^$
-
additionally, a common implicit assumption in approximate Bayesian methods is correct model specification (i.e., there is no bias). this assumption is rarely satisified in practice
2.3 Bayesian uncertainty under model misspecification
Source of model misspecification
in the Bayesian framework, the hypothesis class and the corresponding set of posterior predictive distribitions are only implicitly defined
The importance of bias in interactive settings
model uncertainty is just as important as, if not more than, approximation uncertainty for the problem of optimal design.
3. Direct Epistemic Uncertainty Prediction
estimating the excess risk ER(f, x) requires an estimate of the expected loss $R(f, x)$ and an estimate of the aleatoric uncertainty $A(x)$
DEUP uses observed out-of-sample errors in order to train an error predictor $x \rightarrow e(f, x)$ estimating $x\rightarrow R(f, x)$
given a predictor $x\rightarrow a(x)$ of aleatoric uncertainty,
\[u(f, x) = e(f, x) - a(x)\]becomes an estimator of $ER(f, x)$
How to estimate aleatoric uncertainty?
- if one know $A(x) = 0$
- In regression settings with squared loss, $A(x)$ can be estimated using the empirical variance of different outcomes of the oracle at the same point $x$. If one have multiple independent outcomes $y_1,\ldots, y_K\sim P(Y\mid x)$ for each input point $x$, then training a predictor $a$ with the squared loss on (input, target) examples $(x, \frac{K}{K-1}\mathrm{Var}(y_1,\ldots, y_K))$, yields an estimator of the aleatoric uncertainty.
- in cases where it is not possible to estimate the aleatoric uncertainty, one can use the expected loss estimate $e(f, x)$ as a pessimistic (i.e., conservative) proxy for $u(f, x)$, i.e., set $a(x)$ to 0.
- particularly relevant for settings when uncertainty estimates are used only to rank different data-points and in which there is no reason to suspect that there is a significant variability in aleatoric uncertainty across the input space
- actually the implicit assumption made whenever the variance or entropy of the posterior predictive (which, in principle, accounts for aleatoric uncertainty) is used to measure epistemic uncertainty
in the following subsections, the paper assumes that $x\mapsto a(x)$ is available, whether it is identically equal to 0 or not, and focus on estimating the point-wise risk or expected loss $R(f, x)$.
3.1 Fixed Training Set
3.2 Interaction Settings
EU estimates are used to guide the acquistion of new examples, provide a more interesting use case for DEUP.
However, they bring their own challenges, as the main predictor is retrained multiple times with the newly acquired points
- as the growing training set $z^N = {(x_i, y_i)}{i=1}^N$ for the main predictor $h{z^N}$ changes at each acquistion step.
- view the risk estimate $e(h_{z^N}, x)$ as a function of the pair $(x, z^N)$, rather than a function of $x$ only
- $e$ should thus be trained using a dataset $\cD_e$ of input-target pairs $((x, z^N), l(y, h_z^N(x)))$, where $(x, y)$ is not part of $z^N$
- a learning algorithm with good generalization properties would in principle be able to extrapolate from $\cD_e$, and estimate the errors made by a predictor $h$ on points $x\in\cX$ not seen so far, i.e., belonging to what they call the frontier of knowledge
- the learner usually no access to a held-out validation set, given that the goal of such an interactive learner is to learn the Bayes predictor $f^*$ using as little data as possible
- $N_0 \ge 0$: the number of initially available training points before any acquisition
- observing taht for $i > N_0$, the pair $(x_i, y_i)$ is not used to train the predictors $h_{z^{N_0}}, h_{z^{N_0+1}}, \ldots, h_{z^{i-1}}$, propose to use the future acquired points as out-of-sample examples for the past predictors, in order to build the training dataset $\cD_e$ for the error estimator
- at step $M > N_0$, after acquiring $(M - N_0)$ additional input-output pairs $(x, y)$ and obtaining the predictor $h_{z^M}$, $\cD_e$ is equal to
- using $\cD_e$ requires storing in memory all versions of the main predictor $h$
- it requires using predictors that take as an input a dataset of arbitrary size, which might lead to overfitting issues as the dataset size grows
propose the following two approximation of $\cD_e$
- embed each input pair $(x_i, z^N)$ in a feature space $\Phi$, and replace each such pair in $\cD_e$ with the feature vector $\phi_{z^N}(x)$, hereafter referred to as the stationarizing features of the dataset $z^N$ at $x$ (why? not clear about the motivations)
- to alleviate the need of storing multiple versions of $h$, make each pair $(x_i, y_i)$ contribute to $\cD_e$ once rather than $i - N_0$ times, by replacing the inner union with the singleton ${((x_i, z^{i-1}), l(y_i, h_{z^{i-1}}(x_i)))}$. in other words, for each predictor $h_{z^N}$, only the next acquired point $(x_{N+1}, y_{N+1})$ is used to populate $\cD_e$
these approximations result in the following training dataset of the error estimator at step $M$
\[\cD_E = \{(\phi_{z^{i-1}}(x_i), l(y_i, h_{z^{i-1}(x_i)}))\}_{i\in \\{N_0+1, \ldots, M\\}}\]the paper explored
\[\phi_{z^N}(x) = (x, s, \hat q(x\mid z^N), \hat V(\tilde\cL, z^N, x))\]where
- $\hat q(x\mid z^N)$ is a density estimate from data $z^N$ at $x$
- $s = 1$ if $x$ is part of $z^N$ and otherwise 0
- $\hat V$ is an estimate of the model variance
for numerical reasons, it is preferable to use $\log\hat q$ and $\log \hat V$
Pre-training the error predictor
- if the learner cannot afford to wait for a few rounds of acquisition in order to build a dataset $\cD_e$ large enough to train the error predictor, it is possible to pre-fill $\cD_e$ using the $N_0$ initially available training points $z^{N_0}$ only, following a strategy inspired by K-fold cross-validation.
put all the above things together yields the pseudo-code for DEUP in interactive settings
4. Related Work
Bayesian Learning
- Gaussian Processes are a popular way to estimate EU, as the variance among the functions in the posterior (given the training data) can be computed analytically
- Beyond GPs, efficient MCMC-based approaches have been used for approximating samples from the Bayesian posterior on large datasets
- someone uses the posterior distribution of weights in Bayesian Neural Networks (BNNs) to capture EU
- SWAG fits a Gaussian distribution on the first moments of SGD iterates building upon SWA to define the posterior over the neural network weights. This distribution is then used as a posterior over the neural network weights
- Dusenberry et al. (2020) parameterize the BNN with a distribution on a rank-1 subspace for each weight matrix
- other techniques that rely on measuring the discrepancy between different predictors as a proxy for EU include: MC Dropout, that interprets Dropout as a variational inference technique in BNNs.
- These approaches, relying on sampling of weights or dropout masks at inference time, share some similarities with ensemble-based methods (include bagging, boosting), where multiple predictors are trained and their outputs are averaged to make a prediction, although the latter measure variability due to the training set instead of the spread of functions compatible with the given training set, as in Bayesian approaches
- Deep Ensembles: closer to the Bayesian approach, using an ensemble of neural networks that differ because of randomness in initialization and training
- Wen et al. (2020): a memory-efficient way of implementing deep ensemble, by using one shared matrix and a rank-1 matrix for the parameters per member
- Vadera et al. (2020b); Malinin et al. (2020) improve the efficiency of ensembles by distilling the distribution of predictions rather than the average, thus preserving the information about the uncertainty captured by the ensemble
- Classical work on Query by committee studied the idea of using discrepancy as a measure for information gain for the design of experiments
- using orthogonal certificates which capture the distance between a test sample and the dataset
- distance awareness for uncertainty estimation, and along with proceeding methods DUE, and DDU combine feature representations learnt by deep neural networks with exact Bayesian inference methods like GPs and Gaussian Discriminant Analysis
- Deep Kernel Learning
- alternative instantiation of DKL using RBF networks
- DUN uses the disagreement between the outputs from intermediate layers as a measure of uncertainty
- evidence uncertainty estimation:
- estimates EU based on a parametric estimate of the model variance, which has been shown to have poor uncertainty estimates
- combine several of these techniques in the context of large neural networks
- use estimators of statistical features to capture the uncertainty for OOD detection
Distribution-free uncertainty estimation
- DEUP can broadly be categorized as a distribution-free uncertainty estimation methods, it differs from Conformal Prediction as it does not require a pre-defined degree of confidence before outputting a prediction set
Loss Prediction
- more closely related to DEUP, Yoo & Kweon (2019) propose a loss prediction for learning to predict the value of the loss function
- Hu et al. (2020) also propose using a separate network that learns to predict the variance of an ensemble
- But, these methods are trained only to capture the in-sample error, and do not capture the out-of-sample error which is more relevant for scenarios like active learning where one want to pick $x$ where the reducible generalization error is large.
so they are similar to the DEUP under the fixed training set scenario
- EpiOut propose learning a binary output that simply distinguishes between low or high EU
5. Experiments
through experiments, to support two claims
- C1: Epistemic uncertainty measured by DEUP leads to significant improvements in downstream decision making tasks compared to established baslines
- C2: the error predictor learned by DEUP can generalize to unseen samples
5.1 Sequential Model Optimization
- acquistion functions, such as Expected Improvement (EI) trade-off exploration and exploitation
5.2 Reinforcement Learning
5.3 Uncertainty Estimation
5.3.1 Epistemic Uncertainty Estimation for Drug Combinations
- DrugComb: a dataset consisting of pairwise combinations of anti-cancer compounds tested on various cancer cell lines
- for each combination, the dataset provides access to several synery scores, each indicating whether the two drugs have a synergistic or antagonistic effect on cancerous cell death
- LINCS L1000: contains differential gene expression profiles for various cell lines and drugs
5.3.2 Epistemic Uncertainty Predictions for Rejecting Difficult Examples
consider a standard OOD Detection task
consider the rank correlation between the predicted uncertainty and the observed generalization error
6. Conclusion and Future Work
whereas standard measures of epistemic uncertainty focus on variance (due to approximation error), the paper argues that bias (introduced by misspecification) should be accounted for as part of the epistemic uncertainty, as it is reducible for predictors like neural networks whose effective capacity is a function of the training data