Post-selection inference via algorithmic stability
Posted on
This note is for the paper Zrnic, T., & Jordan, M. I. (2023). Post-selection inference via algorithmic stability. The Annals of Statistics, 51(4), 1666–1691..
it proposes a solution to the problem of inference after selection by building on the framework of algorithmic stability, in particular its branch with origins in the field of differential privacy
stability is achieved via randomization of selection an it serves as a quantitative measure that is sufficient to obtain nontrivial post-selection corrections for classical confidence intervals
the theoretical framework delivers intervals of tunable width, since the condifence intervals derive from a quantitiative measure of the algorithmic stability of the selection procedure.
- $\hat S$: a data-dependent outcome of a selection
- $\beta_S$: the resulting inferential target
- $\hat S_0$: imagine that there is an oracle that guesses $\hat S$, only knowing the method used to arrive at the selection together with the distribution of the data, but not its realization.
a selection procedure is $\eta$-stable for some $\eta > 0$ if there exists an oracle such that, with high probability over the distribution of the data, the likelihood of any selection under $\hat S$ and the likelihood of the same selection under $\hat S_0$ can differ by at most a multiplicative factor of $e^\eta$.
the magnitude of stability depends not only on the selection method, but also on the distribution of the data
For every fixed selection $S$, suppose that $CI_S^{(\alpha)}$ are confidence intervals with valid coverage, \(\bbP\{\beta_S\not\in \text{CI}_S^{(\alpha)}\}\le \alpha\) Let $\hat S$ be an $\eta$-stable selection. Then \(\bbP\{\beta_{\hat S} \not\in \text{CI}_{\hat S}^{(\alpha e^{-\eta})}\} \le \alpha\)
2. Motivating vignettes
2.1 Vignette 1: Winner’s curse
consider the problem of selecting the largest observed effect. suppose one observe an $n$-dimensional vector $y\sim N(\mu, \sigma^2I)$, for example, each entry in this vector could be an observed treatment effect for a separate treatment.
we are interested in doing inference on the most significance effect
more formally, denoting $i_\star = \argmax_i y_i$, we want to construct a confidence interval for $\mu_{i_\star}$
this is a random inferential target because $i_\star$ is a function of the data
one simple way for providing valid inference for $\mu_{i_\star}$ is to apply the Bonferroni correction:
\[\bbP\{\mu_{i_\star} \in (y_{i_\star} \pm z_{1-\alpha/(2n)}\sigma) \} \ge 1 - \alpha\]Benjamini et al. show that a tighter correction is valid,
\[\bbP\{\mu_{i_\star} \in (y_{i_\star} \pm z_{1-\alpha/(n+1)}\sigma)\} \ge 1 - \alpha\]the paper shows that, if we randomize the selection step, rather than select $i_\star$ exactly, the intervals can be made even tighter
Suppose that we select $\hat i_\star = \argmax_i (y_i + \xi_i)$, where $\xi_i \sim_{iid} \text{Lap}\left(\frac{2z_{1-\alpha\delta/(2n)}}{\eta}\right)$, for user-chosen parameters $\eta > 0, \delta \in (0, 1)$, then \(\bbP\{\mu_{\hat i_\star} \in (y_{\hat i_\star} \pm z_{1-\alpha(1-\delta)e^{-\eta}/2}\sigma)\} \ge 1 - \alpha\)
2.2 Vignette 2: Feature Selection
suppose the goal of selection is to simply maximize the absolute correlation of the selected feature with $y$:
\[i_\star = \argmax_i \vert X_i^Ty\vert\]Then, the results imply the following
Suppose that we select $\hat i_\star = \argmax_i \vert X_i^Ty + \xi\vert$, where $\xi \sim_{iid} \text{Lap}\left(\frac{2z_{1-\alpha\delta/(2d)}}{\eta}\right)$, for user-chosen parameters $\eta > 0, \delta \in (0, 1)$. Then \(\bbP(X_{\hat i_\star}^T\mu\in (X_{\hat i_\star}^Ty \pm z_{1-\alpha(1-\delta)e^{-\eta}/2}\sigma)) \ge 1 - \alpha\)
3. Related work
3.1 Simultaneous coverage
in the formulation of post-selection inference by Berk et al, the goal is to construct simultaneous confidence intervals that are valid for any model selection method $\hat M: y\rightarrow \cM$ for a prespecified model class $\cM$.
the stability-based approach does not rely on any independence assumption between different observations
the intervals computed by simultaneous methods and related approaches are unnecessarily wide, as they do not exploit any knowledge of how the analyst arrives at the selected target.
3.2 Conditional coverage
conditional methods exploit properties of the selection procedure. In particular, the goal of conditional post-selection inference is to design $\text{CI}_S$ such that, for all fixed selections $S$,
\[\bbP\{\beta_S\in \text{CI}_S\mid \hat S = S\} \ge 1 -\alpha\]the conditional approach leads to overconditioning, thus leads to wide intervals. Informally, overconditioning refers to the phenomenon of overstating the cost of selection, thus leaving little information for inference.
3.3 Algorithmic stability
a more recent line of work, adaptive data analysis has recognized that a stability concept can be extracted from differential privacy and exploited to obtain perturbation-based generalization guarantees in learning theory
4 Algorithmic stability and selection
A random variable $Q$ is $(\eta, \tau)$-indistinguishable from $W$, denoted $Q\approx_{\eta, \tau}W$, if for all measurable sets $\cO$ \(\bbP\{Q\in \cO\} \le e^\eta \bbP\{W\in \cO\} + \tau\)
Let $\cA: \IR^n\rightarrow \cS$ be a randomized algorithm. We say that $\cA$ is $(\eta, \tau, \nu)$-stable with respect to a distribution $\cP$ supported on $\IR^n$ if there exists a random variable $A_0$, possibly dependent on $\cP$, such that \(\cP\{w\in \IR^n: \cA(w)\approx_{\eta, \tau}A_0\}\ge 1 - \nu\)
Example
Suppose to compute $w^Ty$, for some fixed vector $w$, and suppose $\cP_y$ to be $N(\mu, \sigma^2)$ with known $\sigma > 0$. Let $\cA(y) = w^Ty + \xi$, where $\xi \sim \text{Lap}\left(\frac{z_{1-\nu/2}\sigma\Vert w\Vert_2}{\eta}\right)$.
This mechanism is $(\eta, 0, \nu)$-stable.
5 Confidence intervals after stable selection
suppose that, under selection $S$, the target of inference is $\beta_S$. Moreover, suppose that $\text{CI}_S^{(\alpha)}$ are valid intervals at level $1-\alpha$ for any fixed $S$, meaning $\bbP{\beta_S\not\in \text{CI}_S^{(\alpha)}} \le \alpha$
Fix $\delta\in (0, 1)$, and let $\hat S$ be an $(\eta, \tau, \nu)$-stable selection algorithm. Then \(\bbP\{\beta_{\hat S}\not \in \text{CI}^{\delta e^{-\eta}}\}\le \delta + \tau + \nu\)
5.1 Comparison with data splitting
suppose allocate $f$-fraction of the data to selection, and $(1-f)$-fraction to inference. then, the resulting intervals will roughly look like classical intervals augmented by a factor of $\sqrt{\frac{1}{1-f}}$ regardless of how the selection is performed
suppose for concretness that $y\sim N(\mu, I)$ and we are considering doing inference on one of two targets, $\nu_0^T\mu$ or $\nu_1^T\mu$, where the selection $\hat S\in {0, 1}$ depends on the data $y$
another proposal that is conceptually closely related to data splitting, namely the (U, V) decomposition
6 Model selection in linear regression
here, do not assume $\mu = \bbE[y]$ can be expressed as a linear combination of ${X_i}_{i=1}^d$
the goal is to construct simultaneous CIs for the target of inference $\beta_{\hat M}$
the intervals resulting from this approach are of the form
\[\text{CI}_{j\hat M}(K):= (\hat \beta_{j\hat M} \pm K\hat \sigma_{j\hat M})\]refer to the minimal such valid $K$ as the PoSI constant
7 Conditional coverage
7.1 Example: Publication bias
suppose we observe an effect $y\sim \cP_y$ with $\bbE[y]=\mu, \text{supp}(\cP_y)\subset \IR$. We are interested in constructing an interval for $\mu$ only if the observed effect is deemed interesting enough, for example, if $y > T$ for some threshold $T$. Denote by $\text{report}(y)$ the event that we decide to report the confidence interval.
8 The design of stable selection algorithms
8.1 Properties of stability
- closure under post-processing
- composition
8.2 Model selection algorithms
8.2.1 Model selection via the Lasso
8.2.2 Model selection via marginal screening
9 Experimental results
evaluate the selective intervals for the LASSO and marginal screening and compare the solution with data splitting