WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Synthetic Instrument for Sparse Causation

Posted on
Tags: Multivariate Analysis, Causal Inference, Unmeasured Confounding

This note is for Tang, D., Kong, D., & Wang, L. (2024). The synthetic instrument: From sparse association to sparse causation (No. arXiv:2304.01098). arXiv.

the synthetic instrument: from sparse association to sparse causation

  • standard approaches for high-dimensional data, such as the lasso, assume that the associations between the exposures and the outcome are sparse
    • but these methods do not estimate causal effects in the presence of unmeasured confounding.
  • the paper considers an alternative approach that assumes the causal effects under consideration are sparse
    • with sparse causation, causal effects are identifiable even with unmeasured confounding
    • at the core of the proposal is a novel device called the synthetic instrument, which in contrast to standard instrumental variables, can be constructed using the observed exposures directly
    • under the assumption of sparse causation, the problem of causal effect estimation can be formulated as an $\ell_0$-penalization problem

Introduction

  • Hastie et al. (2009): “bet on sparsity” principle——”use a procedure that performs well in sparse problems, as no procedure performs well in dense problems”
  • the “bet on sparsity” principle does not restrict the type of sparse problems that can be considered
    • sparse causation
  • consider a linear structural model with a $p$-dimensional exposure, an outcome $Y$, and a $q$ dimensional latent variable $U$
\[X = \Lambda U + \varepsilon_x\\ Y = X^T\beta + U^T\gamma + \varepsilon_y\]
  • under this model, the spurious correlations due to unmeasured confounding, characterized by $\Cov(X)^{-1}\Lambda\gamma$, are typically dense
    • as a result, the association between $X$ and $Y$ is not sparse, even if the causal effect $\beta$ is sparse

identification and estimation of the causal parameter $\beta$ are non-trivial due to the presence of unmeasured confounding by $U$

their contributions in this paper are twofold

  • under an additional plurality condition, they establishes that the parameter $\beta$ is identifiable iff $\Vert\beta\Vert_0 < p - q$
    • the assumption on the sparsity level is both necessary and sufficient
  • develop a synthetic two-stage regularized regression approach for estimating $\beta$, with a first-stage OLS and a second-stage $\ell_0$-penalized regression
  • multivariate hidden confounding
  • a spectral deconfounding method for estimating $\beta$ under a high-dimensional model
  • allow the outcome to be multivariate, aim to identify the projection of $\beta$ onto a related space rather than the causal parameter $\beta$ itself
  • the estimation problem of $\beta$ can be framed within the context of causal inference with unmeasured confounding
    • the most popular approach is the instrumental variable (IV) framework
    • proximal causal inference framework: use information from ancillary variables known as negative control exposures and outcomes to remove bias due to unmeasured confounding.
  • compared with these frameworks, their approach does not rely on the collection of additional ancillary variables. Instead, they rely on the availability of multiple exposures and the sparsity assumption for identification and estimation

a strand of literature to identify the causal effects of multiple exposures

  • deconfounder method, which first obtains an estimate $\hat U$ for the unmeasured confounder and the adjusts for $\hat U$ using standard regression methods
    • it has been pointed out that under this setting, without further assumptions, the causal effect $\beta$ is not identifiable.
  • Miao et al. (2023) consider a similar setting, showing that the the causal effect is identifiable if $\Vert \beta_0\Vert \le (p-q)/2$. Their sparsity constraint is significantly stronger than here, especially in cases when the number of exposures is large relative to the number of latent confounders
    • Miao et al. (2023) also develops a robust linear regression-based estimator for estimating $\beta$

the results also connect to recent literature on multiply robust causal identification, as they show identification in the union of many causal models

  • this contrasts with the rich literature on multiply robust estimators under the same causal model and improved doubly robust estimators that are consistent under multiple working models for two components of the likelihood

Outline of the paper

  • Sec 2: introduce the setup and background
  • Sec 3: introduce the identification strategy using the synthetic instrument method
  • Sec 4: present the estimation procedure and provide theoretical justifications. Also extensions to non-linear outcome models
  • Sec 5: Simulation studies
  • Sec 6: mouse obesity data
  • Sec 7: discussion

2. Framework, notation, and identifiability

2.1 The model

\[Y= X^T\beta + g(U) + \varepsilon_y\]

2.2 Identifiability of the causal effect $\beta$

suppose $p = 3, q = 1$

first, without additional assumptions, $\beta$ is generally not identifiable due to unmeasured confounding by $U$, we have

\[\Cov(X_j, Y) = ..., j = 1,2,3\]

since there are three equations but four unknown parameters $\beta_1, \beta_2, \beta_3, \gamma$, the causal parameters $\beta$ are not identifiable from these equations

one possible approach to identifying $\beta$ is to assume prior knowledge about certain elements of $\beta$. If it is assumed that $\beta_2 = 0$, we have $\beta_1, \beta_3$ and $\vert\gamma\vert$ are identifiable.

Image

however, in practice, it is often difficult to know which exposures have zero causal effects a priori. This paper considers the sparsity assumptions.

2.3 Instrumental variable

\[Y = \beta X + \gamma U + \pi Z + \epsilon_y\\ X = \alpha_z Z + \Lambda U + \epsilon_x\]

For $Z$ to be a valid instrumental variable, the following three assumptions:

  • $\pi = 0$: exclusion restriction
  • $\alpha_z\neq 0$: instrumental relevance
  • $\Cov(U, Z) = 0$: unconfoundness

under these assumptions, one can consistently estimate $\beta$ via a two-stage least squares estimator:

  1. obtain the predicted exposure $\hat\bbE(X\mid Z)$ by linearly regressing $X$ on $Z$
  2. and then regress $Y$ on $\hat\bbE(X\mid Z)$ to obtain an estimate of $\beta$

3. Identifying causal effects via the synthetic instrument

3.1 A new identification approach via voting

3.2 The synthetic instrument

3.3 Voting with the synthetic instrument

3.4 Synthetic two-stage regularized regression

4. Estimation via the synthetic two-stage regularized regression

4.1 Estimation

4.2 Theoretical properties

4.3 Extension to nonlinear settings

5. Simulations

5.1 Simulation studies with a linear outcome model

5.2 Simulation studies with non-linear outcome models

6. Real data application

7. Discussion


Published in categories