WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Direct Epistemic Uncertainty Prediction

Posted on
Tags: Uncertainty Quantification, Epistemic Uncertainty, Distribution-free

Lahlou, S., Jain, M., Nekoei, H., Butoi, V. I., Bertin, P., Rector-Brooks, J., Korablyov, M., & Bengio, Y. (2023). DEUP: Direct Epistemic Uncertainty Prediction (No. arXiv:2102.08501). arXiv. https://doi.org/10.48550/arXiv.2102.08501

Epistemic uncertainty: a measure of the lack of knowledge of a learner which diminishes with more evidence

  • while existing work focuses on using the variance of the Bayesian posterior due to parameter uncertainty as a measure of epistemic uncertainty, the paper argues that this does not capture the part of lack of knowledge induced by model misspecification

the paper

  • discuss how the excess risk, which is the gap between the generalization error of a predictor and the Bayes predictor: a sound measure of epistemic uncertainty which captures the effect of model misspecification
  • propose a principled framework for directly estimating the excess risk by learning a secondary predictor for the generalization error and subtracting an estimate of aleatoric uncertainty, i.e., intrinsic unpredictability
  • discuss the merits of the novel measure of epistemic uncertainty, and highlight how it differs from variance-based measures of epistemic uncertainty
  • the framework Direct Epistemic Uncertainty Prediction (DEUP) is particularly interesting in interactive learning environments, where the learner is allowed to acquire novel examples in each round
  • through a wide set of experiments, illustrate how existing methods in sequantial model optimization can be improved with epistemic uncertainty estimates from DEUP, and how DEUP can be used to drive exploration in reinforcement learning. Also evaluate the quality of uncertainty estimates from DEUP for probabilistic image classification and predicting synergies of drug combinations

Introduction

  • epistemic uncertainty (EU): a measure of lack of knowledge that an active learner should minimize
  • aleatoric uncertainty: an intrinsic notion of randomness and is irreducible by nature.
  • EU can be potentially reduced with additional information

EU estimation is a key ingredient in interactive decision making settings such as

  • active learning
  • sequential model optimization (SMO)
  • reinforcement learning (RL)

What is uncertainty and how to quantify it?

  • MC-Dropout and Deep Ensembles, both of which are approximate Bayesian methods, use the posterior predictive variance as a measure of uncertainty: if multiple neural nets that are all compatible with the data make different predictions at $x$, the discrepancy between these predictions is a strong indicator of epistemic uncertainty at $x$

Pitfalls of using the Bayesian posterior to estimate uncertainty

epistemic uncertainty in a predictive model, seen as a measure of lack of knowledge, consists of

  • approximation uncertainty: due to the finite size of the training dataset
  • model uncertainty: due to model misspecification

approximate bayesian methods often suffer from model misspecification, e.g., due to the implicit bias induced by SGD and the finite computational time

Image

contributions:

  1. analyze the pitfalls of using discrepancy-based measures of EU, given that they miss out on model complexity, which is defined as a component of the excess risk, i.e., the gap between the risk (or out-of-sample loss) of the predictor at point $x$ and that of the Bayes predictor at $x$ (the one with the lowest expected loss, that no amount of additional data could reduce)
  2. consider the fundamental notion of epistemic uncertainty as lack of knowledge, and based on this, they proposed to estimate the excess risk as a measure of EU. Propose DEUP (Direct Epistemic Uncertainty Prediction), where one train a secondary model, called the error predictor, with an approximate objective and approximate data, to estimate the point-wise generalization error (the risk), and then subtract an estimate of aleatoric uncertainty if available, or provide an upper bound on EU.

accounting for bias when measuring EU is particularly useful to an interactive learner whose effective capacity depends on the training data, especially in regions of the input space one care about

  • DEUP is agnostic to the type of predictors used, and in interactive settings, it is agnostic to the particular search method still needed to select points with high EU in order to propose new candidates for active learning, SMO, or exploration in RL

a unique advantage of DEUP, compared with discrepancy-based measure of EU, is that it can be explicitly trained to care about, and calibrate for estimating the uncertainty for examples which may come from a distribution slightly different from the distribution of most of the training examples.

  • such distribution shifts (referred to as feedback covariate shift) arise naturally in interactive contexts such as RL, because the learner explores new areas of the input space
  • in these non-stationary settings, one typically wants to retrain the main predictor as one acquire new training data, not just because more training data is generally better but also better track the changing distribution and generalize to yet unseen but upcoming OOD inputs
  • a large error initially made at a point $x$ before $x$ is incorporated in the training set will typically be greatly reduced after updating the main predictor with $(x, y)$.
  1. use additional features as input to the error predictor, that are informative of both the input point and the dataset used to obtain the current predictor

2. Excess Risk, Epistemic Uncertainty, and Model Misspecification

2.1 Notations and Background

  • outputs $y\in \cY$
  • inputs $x\in \cX\subset \IR^d$
  • function $f: \cX\rightarrow \cA$
  • training dataset: $z^N = (z_1,\ldots, z_N)\in \IR \cZ^N$, where $\cZ = \cX\times \cY$ and $z_i = (x_i, y_i)$
  • unknown ground-truth generative model: $P(X, Y) = P(X)P(Y\mid X)$
  • set $\cA$: the action space
    • equal to $\cY$ for regression
    • equal to $\Delta(\cY)$, the set of probability distributions on $\cY$ for classification problems
  • loss function: $l: \cY\times \cA\rightarrow\IR^+$
    • $l(y, a) = \Vert y-a\Vert^2$ for regression
    • $l(y, a) = -\log a(y)$ for classification

the point-wise risk (or expected loss) of a predictor $f$ at $x\in\cX$ is defined as

\[R(f, x) = \bbE_{P(Y\mid X=x)}[l(Y, f(x))]\]

define the risk (total expected loss) of a predictor as the marginalization over $x$:

\[R(f) = \bbE_{P(X, Y)}[l(Y, f(X))]\]

given a hypothesis space $\cH$, a subset of $\cF(\cX, \cA)$, the set of function $f$

the goal of any learning algorithm (or learner) $\cL$ is to find a predictive function $h^*\in \cH$ with the lowest possible risk

\[h^* = \argmin_{h\in\cH} R(h)\]

in practice, the learner $\cL(z^N): h_{z^N}\in \cH$ minimizing an approximization of the risk, called the empirical risk

\[R_{z^N}(h) = \frac{1}{N}\sum_{i=1}^N l(y_i, h(x_i))\,, \qquad \text{where} z_i = (x_i, y_i)\\ h_{z^N} = \argmin_{h\in \cH} R_{z^N}(h)\]

the dataset $z^N$ need not be used solely to define the empirical risk, e.g., the learner can use a subset as a validation set to tune its hyperparameter

2.2 Sources of lack of knowledge

  • Bayes predictor: $f^*(x) = \argmin_{a\in A}\bbE_{P(Y\mid X=x)}[l(Y, a)]$.
    • $R(f^*, x) > 0$ indicates an irreducible risk due to the inherent randomness of $P(Y\mid X=x)$.
    • $R(f^*, x)$ is thus a measure of aleatoric uncertainty at $x$
    • maybe more than one Bayes predictor, but they have the same point-wise risk, $A(x) = R(f^*, x)$
  • minimizing the risk over $\cH$ rather than $\cF(\cX, \cA)$ induces a discrepancy between $h^$ (the optimal in $\cH$) and $f^$ (the Bayes predictor), usually referred to as model uncertainty. this can be seen as a form of bias, as the optimization is limited to functions in $\cH$
  • discrepancy between $h_{z^N}$ and $h^\star$ called the approximation uncertainty, due both to the finite training set and finite comupuational resources for training

Image

excess risk: \(ER(f, x) = R(f, x) - A(x)\)

an estimator of the excess risk of $f$ at $x$ can be used as a measure of epistemic uncertainty

using an estimator of ER as a measure of EU is the main idea behind DEUP

\(P(Y\mid X=x) = N(Y; \mu(x), \sigma^2(x))\) Bayes predictor $f^*=\mu$ \(R(f, x) = \bbE_{P(Y\mid X=x)}[Y-f(x)]^2 = \bbE[Y^2] + \Var(Y) -2f(x)\bbE Y + f^2(x) = \sigma^2(x) + (f(x) - \mu(x))^2\) so \(ER(f, x) = (f(x) - \mu(x))^2\)

for classification \(R(f, x) = \bbH(\mu(x), f(x))\\ ER(f, x) = D_{KL}(\mu(x), f(x))\)

model misspecification if $f^\star\in \cH$, where $f^*$ is the Bayes predictor

  • there is no agreed upon measure of misspecification, some authors focus on Bayesian or approximate Bayesian learners, which maintain a distribution over predictors, and use a discrepancy measure between the best reachable posterior predictive $p(Y\mid X)$ defined by function $h\in \cH$, and the group-truth likelihood $P(Y\mid X)$

  • alternatively, assuming the function space $\cF(\cX, \cA)$ is endowed with a metric, misspecification (bias) can be defined as the distance between $h^$ and $f^$

  • additionally, a common implicit assumption in approximate Bayesian methods is correct model specification (i.e., there is no bias). this assumption is rarely satisified in practice

2.3 Bayesian uncertainty under model misspecification

Source of model misspecification

in the Bayesian framework, the hypothesis class and the corresponding set of posterior predictive distribitions are only implicitly defined

The importance of bias in interactive settings

model uncertainty is just as important as, if not more than, approximation uncertainty for the problem of optimal design.

3. Direct Epistemic Uncertainty Prediction

estimating the excess risk ER(f, x) requires an estimate of the expected loss $R(f, x)$ and an estimate of the aleatoric uncertainty $A(x)$

DEUP uses observed out-of-sample errors in order to train an error predictor $x \rightarrow e(f, x)$ estimating $x\rightarrow R(f, x)$

given a predictor $x\rightarrow a(x)$ of aleatoric uncertainty,

\[u(f, x) = e(f, x) - a(x)\]

becomes an estimator of $ER(f, x)$

How to estimate aleatoric uncertainty?

  1. if one know $A(x) = 0$
  2. In regression settings with squared loss, $A(x)$ can be estimated using the empirical variance of different outcomes of the oracle at the same point $x$. If one have multiple independent outcomes $y_1,\ldots, y_K\sim P(Y\mid x)$ for each input point $x$, then training a predictor $a$ with the squared loss on (input, target) examples $(x, \frac{K}{K-1}\mathrm{Var}(y_1,\ldots, y_K))$, yields an estimator of the aleatoric uncertainty.
  3. in cases where it is not possible to estimate the aleatoric uncertainty, one can use the expected loss estimate $e(f, x)$ as a pessimistic (i.e., conservative) proxy for $u(f, x)$, i.e., set $a(x)$ to 0.
    • particularly relevant for settings when uncertainty estimates are used only to rank different data-points and in which there is no reason to suspect that there is a significant variability in aleatoric uncertainty across the input space
    • actually the implicit assumption made whenever the variance or entropy of the posterior predictive (which, in principle, accounts for aleatoric uncertainty) is used to measure epistemic uncertainty

in the following subsections, the paper assumes that $x\mapsto a(x)$ is available, whether it is identically equal to 0 or not, and focus on estimating the point-wise risk or expected loss $R(f, x)$.

3.1 Fixed Training Set

Image

3.2 Interaction Settings

EU estimates are used to guide the acquistion of new examples, provide a more interesting use case for DEUP.

However, they bring their own challenges, as the main predictor is retrained multiple times with the newly acquired points

  1. as the growing training set $z^N = {(x_i, y_i)}{i=1}^N$ for the main predictor $h{z^N}$ changes at each acquistion step.
    • view the risk estimate $e(h_{z^N}, x)$ as a function of the pair $(x, z^N)$, rather than a function of $x$ only
    • $e$ should thus be trained using a dataset $\cD_e$ of input-target pairs $((x, z^N), l(y, h_z^N(x)))$, where $(x, y)$ is not part of $z^N$
    • a learning algorithm with good generalization properties would in principle be able to extrapolate from $\cD_e$, and estimate the errors made by a predictor $h$ on points $x\in\cX$ not seen so far, i.e., belonging to what they call the frontier of knowledge
  2. the learner usually no access to a held-out validation set, given that the goal of such an interactive learner is to learn the Bayes predictor $f^*$ using as little data as possible
  • $N_0 \ge 0$: the number of initially available training points before any acquisition
  • observing taht for $i > N_0$, the pair $(x_i, y_i)$ is not used to train the predictors $h_{z^{N_0}}, h_{z^{N_0+1}}, \ldots, h_{z^{i-1}}$, propose to use the future acquired points as out-of-sample examples for the past predictors, in order to build the training dataset $\cD_e$ for the error estimator
  • at step $M > N_0$, after acquiring $(M - N_0)$ additional input-output pairs $(x, y)$ and obtaining the predictor $h_{z^M}$, $\cD_e$ is equal to
\[\cD_e = \cup_{i=N_0+1}^M\cup_{N=N_0}^{i-1}\left{((x_i, z^N), l(y_i, h_{z^N}(x_i)))\right}\]
  • using $\cD_e$ requires storing in memory all versions of the main predictor $h$
  • it requires using predictors that take as an input a dataset of arbitrary size, which might lead to overfitting issues as the dataset size grows

propose the following two approximation of $\cD_e$

  1. embed each input pair $(x_i, z^N)$ in a feature space $\Phi$, and replace each such pair in $\cD_e$ with the feature vector $\phi_{z^N}(x)$, hereafter referred to as the stationarizing features of the dataset $z^N$ at $x$ (why? not clear about the motivations)
  2. to alleviate the need of storing multiple versions of $h$, make each pair $(x_i, y_i)$ contribute to $\cD_e$ once rather than $i - N_0$ times, by replacing the inner union with the singleton ${((x_i, z^{i-1}), l(y_i, h_{z^{i-1}}(x_i)))}$. in other words, for each predictor $h_{z^N}$, only the next acquired point $(x_{N+1}, y_{N+1})$ is used to populate $\cD_e$

these approximations result in the following training dataset of the error estimator at step $M$

\[\cD_E = \{(\phi_{z^{i-1}}(x_i), l(y_i, h_{z^{i-1}(x_i)}))\}_{i\in \\{N_0+1, \ldots, M\\}}\]

the paper explored

\[\phi_{z^N}(x) = (x, s, \hat q(x\mid z^N), \hat V(\tilde\cL, z^N, x))\]

where

  • $\hat q(x\mid z^N)$ is a density estimate from data $z^N$ at $x$
  • $s = 1$ if $x$ is part of $z^N$ and otherwise 0
  • $\hat V$ is an estimate of the model variance

for numerical reasons, it is preferable to use $\log\hat q$ and $\log \hat V$

Pre-training the error predictor

  • if the learner cannot afford to wait for a few rounds of acquisition in order to build a dataset $\cD_e$ large enough to train the error predictor, it is possible to pre-fill $\cD_e$ using the $N_0$ initially available training points $z^{N_0}$ only, following a strategy inspired by K-fold cross-validation.

Image

put all the above things together yields the pseudo-code for DEUP in interactive settings

Image

Bayesian Learning

  • Gaussian Processes are a popular way to estimate EU, as the variance among the functions in the posterior (given the training data) can be computed analytically
  • Beyond GPs, efficient MCMC-based approaches have been used for approximating samples from the Bayesian posterior on large datasets
    • someone uses the posterior distribution of weights in Bayesian Neural Networks (BNNs) to capture EU
    • SWAG fits a Gaussian distribution on the first moments of SGD iterates building upon SWA to define the posterior over the neural network weights. This distribution is then used as a posterior over the neural network weights
    • Dusenberry et al. (2020) parameterize the BNN with a distribution on a rank-1 subspace for each weight matrix
  • other techniques that rely on measuring the discrepancy between different predictors as a proxy for EU include: MC Dropout, that interprets Dropout as a variational inference technique in BNNs.
    • These approaches, relying on sampling of weights or dropout masks at inference time, share some similarities with ensemble-based methods (include bagging, boosting), where multiple predictors are trained and their outputs are averaged to make a prediction, although the latter measure variability due to the training set instead of the spread of functions compatible with the given training set, as in Bayesian approaches
  • Deep Ensembles: closer to the Bayesian approach, using an ensemble of neural networks that differ because of randomness in initialization and training
    • Wen et al. (2020): a memory-efficient way of implementing deep ensemble, by using one shared matrix and a rank-1 matrix for the parameters per member
    • Vadera et al. (2020b); Malinin et al. (2020) improve the efficiency of ensembles by distilling the distribution of predictions rather than the average, thus preserving the information about the uncertainty captured by the ensemble
  • Classical work on Query by committee studied the idea of using discrepancy as a measure for information gain for the design of experiments
    • using orthogonal certificates which capture the distance between a test sample and the dataset
    • distance awareness for uncertainty estimation, and along with proceeding methods DUE, and DDU combine feature representations learnt by deep neural networks with exact Bayesian inference methods like GPs and Gaussian Discriminant Analysis
  • Deep Kernel Learning
    • alternative instantiation of DKL using RBF networks
    • DUN uses the disagreement between the outputs from intermediate layers as a measure of uncertainty
  • evidence uncertainty estimation:
    • estimates EU based on a parametric estimate of the model variance, which has been shown to have poor uncertainty estimates
    • combine several of these techniques in the context of large neural networks
    • use estimators of statistical features to capture the uncertainty for OOD detection

Distribution-free uncertainty estimation

  • DEUP can broadly be categorized as a distribution-free uncertainty estimation methods, it differs from Conformal Prediction as it does not require a pre-defined degree of confidence before outputting a prediction set

Loss Prediction

  • more closely related to DEUP, Yoo & Kweon (2019) propose a loss prediction for learning to predict the value of the loss function
  • Hu et al. (2020) also propose using a separate network that learns to predict the variance of an ensemble
  • But, these methods are trained only to capture the in-sample error, and do not capture the out-of-sample error which is more relevant for scenarios like active learning where one want to pick $x$ where the reducible generalization error is large.

so they are similar to the DEUP under the fixed training set scenario

  • EpiOut propose learning a binary output that simply distinguishes between low or high EU

5. Experiments

through experiments, to support two claims

  • C1: Epistemic uncertainty measured by DEUP leads to significant improvements in downstream decision making tasks compared to established baslines
  • C2: the error predictor learned by DEUP can generalize to unseen samples

5.1 Sequential Model Optimization

  • acquistion functions, such as Expected Improvement (EI) trade-off exploration and exploitation

Image

Image

5.2 Reinforcement Learning

Image

5.3 Uncertainty Estimation

5.3.1 Epistemic Uncertainty Estimation for Drug Combinations

  • DrugComb: a dataset consisting of pairwise combinations of anti-cancer compounds tested on various cancer cell lines
    • for each combination, the dataset provides access to several synery scores, each indicating whether the two drugs have a synergistic or antagonistic effect on cancerous cell death
  • LINCS L1000: contains differential gene expression profiles for various cell lines and drugs

Image

5.3.2 Epistemic Uncertainty Predictions for Rejecting Difficult Examples

consider a standard OOD Detection task

consider the rank correlation between the predicted uncertainty and the observed generalization error

Image

6. Conclusion and Future Work

whereas standard measures of epistemic uncertainty focus on variance (due to approximation error), the paper argues that bias (introduced by misspecification) should be accounted for as part of the epistemic uncertainty, as it is reducible for predictors like neural networks whose effective capacity is a function of the training data


Published in categories