Gaussian DAGs on Network Data
Posted on (Update: )
This post is based on Li, H., & Zhou, Q. (2019). Gaussian DAGs on network data. ArXiv:1905.10848 [Cs, Stat].
Graphical models based on DAGs have a variety of applications, including
- genetics
- causal inference
- machine learning
Extensive research to recover the structure of DAGs from observational and experimental data. Most structure learning algorithms can be categorized as
- score-based: maximizing a scoring function, such as MDL, BIC, Bayesian scores, with various search strategies
- order-based search
- greedy search
- coordinate descent
- constraint-based: perform conditional independence tests among variables to construct a skeleton and then orient some of the edegs
- hybrid methods: combine the above two approaches
Formulation of the classical DAG learning problem
- $X=(X_1,\ldots,X_p)\in \IR^{n\times p}$: data matrix of $n$ i.i.d. observations on $p$ nodes.
- goal: learn the structure of $G^*$
- focus: the Gaussian DAG model which is equivalent to the following Gaussian linear structural equations model (SEM):
\begin{equation} x_i = B^Tx_i + e_i, \; e_i\sim N_p(0, \Omega), \; i\in [n]\label{eq:sem1} \end{equation} where $B = (\beta_{kj})_{p\times p}$ is the weighted adjacency matrix of $G^*$ with the edge set $E={(k, j):\beta_{kj} \neq 0 }$ and $\Omega =\diag(\omega_1^2,\ldots, \omega_p^2)$.
To get a better understanding, we can write out the $j$-th component,
\[x_{ij}=(B^Tx_i)_j + e_{ij} = \sum_{k=1}^p(B^T)_{jk}x_{ik} = \sum_{k=1}^pB_{kj}x_{ik}\,.\]- goal (re-expressed): given iid samples $\{x_i\}_{i=1}^n$, estimate $B$.
A model for network data
independence among observed samples $\{x_i\}_{i=1}^n$ is one of the main assumptions for all the aforementioned DAG learning methods.
Use an undirected graph $\cG$ with each node representing an observation $x_i$ and each edge representing the conditional dependence between two observations given the rest. Let $A(\cG)$ be the adjacency matrix of $\cG$.
Generalize the SEM \eqref{eq:sem1} to
\begin{equation}\label{eq:sem2} X_j =\sum_{k\in \Pi_j}\beta_{kj}X_k + \varepsilon_j\,, \; \varepsilon_j =(\varepsilon_{1j},\ldots, \varepsilon_{nj})\sim N_n(0, \omega_j^2\Sigma) \end{equation}
where $\Sigma$ is positive definite with $\diag(\Sigma)=1$, $\Pi_j$ is the parent set of $X_j$, and $\varepsilon_j$ are independent of each other. (so the parent set exhibits DAG, while the dependence of observations embodies via undirected graph model)
when $\Sigma=I_n$, the SEM \eqref{eq:sem2} reduces to \eqref{eq:sem1}.
The distinction between \eqref{eq:sem1} and \eqref{eq:sem2} becomes more clear when regarding \eqref{eq:sem2} as a semi-Markovian causal model. Representing each variable $z_i$ in a DAG $G$ on vertices $\{z_1,\ldots,z_p\}$ using a deterministic function:
\[z_i = f_i(\Pi_i, u_i), i\in [p]\,,\]where $\Pi_i$ is the parents of node $z_i$ in $G$ and $u_i$ are noises, sometimes also referred to as background variables.
- Markovian: the noise variables $u_i$ are jointly independent
- semi-Markovian: they are dependent
Contributions
when the nodes of the DAG are
- in natural ordering (??), develop a maximum penalized likelihood algorithm based on block coordinate descent (BCD) to jointly estimate the DAG structure and the sample correlations
- unknown. show empirically that the algorithm can still estimate the sample correlation accurately, which can be used to improve existing DAG learning methods via data decorrelation.