Reconstruct Gaussian DAG
Posted on
This note is based on Yuan, Y., Shen, X., Pan, W., & Wang, Z. (2019). Constrained likelihood for reconstructing a directed acyclic Gaussian graph. Biometrika, 106(1), 109–125.
DAG is widely used to describe directional pairwise relations. The relations are estimated by reconstructing a directed acyclic graph’s structure, which is challenging when the ordering of nodes of the graph is unknown.
Motivation
Existing methods such as the neighbourhood and search-and-score methods have high estimation errors or computational complexities, especially when a local or sequential approach is used to enumerate edge directions by testing or optimizing a criterion locally, as a local method may break down even for moderately sized graphs.
Proposal
A novel approach to simultaneously identifying all estimable directed edges and model parameters, using constrained maximum likelihood with nonconvex constraints.
Develop a constraint reduction method that constructs a set of active constraints from super-exponentially many constraints.
Coupled with an alternating direction method of multipliers and a difference convex method, permits efficient computation for large-graph learning.
The proposed method consistently reconstructs identifiable directions of the true graph and achieves the optimal performances in terms of parameter estimation.
Numerically, the method compares favourably with competitors. A protein network is analysed to demonstrate that the proposed method can make a difference in identifying the network’s structure.
Literatures
Two types
local conditional independence tests sequentially to check the existence of directed edges
usually have worst-case complexity that is exponential in the number of nodes
- PC algorithm: $O(p^q)$ with $p$ nodes and maximal neighborhood size $q$.
- Search-and-score: optimize a goodness-of-fit measure for possible directions in a neighborhood of interest.
- $L_1$ penalization method for learning a directed acyclic graphical model with a known topological ordering of the nodes.
- $L_1$ penalization method for interventional data
- a high-dimensional additive structural equation model based on order search and penalized regression.
exact learning method
- dynamic programming: guarantee a global minimizer but has exponential time complexity and is feasible only for small graphs.
- mixed integer linear programming: may reduce the complexity to a polynomial order when the maximal number of parent nodes is restricted to one or two in the search. such a restriction limits the scope of application and may destroy the acyclically requirement of a directed acyclic graph and hence the local Markov property.
the paper claims that they can employ an iterative difference convex algorithm to seek a good local minimizer with fast convergence.
theoretical challenges:
- the reconstruction error of a directed acyclic graph may grow super-exponentially in $p$
- little theory to guide practice for reconstructing a directed acyclic graph
Gaussian directed acyclic graph
The joint distribution of $(X_1,\ldots,X_p)$ is $\prod_{j=1}^p\pr(X_j\mid \pa_j)$. The factorization property is equivalent to the local Markov property of a directed acyclic graph.
Embed directional effects induced by directed edges into a structural equation model:
\[X_j = f_j(\pa_j,Z_j)\]where $Z_j$ is the latent error representing unexplained variation in node $j$. Consider a Gaussian structural equation model in which each $f_j(\cdot,\cdot)$ becomes linear in $(\pa_j,Z_j)$ and $Z_j$ follows an independent $N(0,\sigma_j^2)$ distribution,
\[X_j=\sum_{k:k\neq j}A_{jk}X_k+Z_j\,\qquad Z_j\sim N(0,\sigma_j^2)\,,\]where $A$ is the parametrized adjacency matrix in which a nonzero $(j,k)$ element $A_{jk}$ corresponds to a directed edge from parent node $k$ to child node $j$, with its value $A_{jk}$ indicating the strength of the relation. Our goal is to estimate $A$ subject to the requirement that $A$ defines a directed acyclic graph.
DAG models have been under-studies compared with undirected graphs. The major challenge stems from acyclically and the corresponding nonconvex parameter space, defined by nonconvex constraints reinforcing acyclically of a graph. In contrast, the parameter space for Gaussian UG models is a positive-semidefinite cone, which a convex.
Denote a directed cycle by $(j_1,\ldots,j_L)$, a sequence of indices of nodes, where $L$ is the length of the directed cycle, and $j_1$ is required to be the smallest index. A DAG doesn’t contain a directed cycle. This implies that the number of directed edges is smaller than the number of nodes involved in every possible directed cycle. For example, to prevent a directed cycle, say $(1,2,3)$, we introduce the constraint
\[I(A_{21}\neq 0) + I(A_{32}\neq 0) + I(A_{13}\neq 0)\le 2\,.\]On this basis, we introduce constraints on entries of $A$ to reinforce acyclically of any order.
\[\begin{align*} \sum_{j_{L+1}=j_1:1\le k\le L}I(A_{j_{k+1}j_k}\neq 0)\le L-1\\ \{(j_1,\ldots,j_L):j_1=\underset{1\le k\le L}{\min}j_k,j_1\neq\cdots\neq j_L,L=2,\ldots,p\} \end{align*}\]The total number of constraints is
\[\binom{p}{2} + \binom{p}{2}2!+\cdots + \binom{p}{p}(p-1)!=O(p^p)\,.\]Theorem 1 implies that the $O(p^p)$ constraints can be reduced to $p^3-p^2$ constraints.
Constrained Maximum Likelihood
Suppose that $\sigma_j(j=1,\ldots,p)$ are equal, we have
\[X_j=\sum_{k:k\neq j}A_{jk}X_k+Z_j\]where all the $Z_j$ are independent and identically distributed according to $N(0,\sigma^2)$ with $\sigma > 0$.
Given an $n\times p$ data matrix ${x_{ij}}_{n\times p}$, the negative log-likelihood, after dropping a constant term, is
\[l(A) = \frac 12\sum_{j=1}^p\sum_{i=1}^n\Big(x_{ij}-\sum_{k:k\neq j}x_{ik}A_{jk}\Big)^2\,,\]which is convex in $A$.
In addition to the constraints on acyclicity, also impose a constraint to regularize its sparsity. Then the directed acyclic graph learning problem can be formulated as
\[\begin{align*} \underset{A}{\min} & l(A)\\ \text{subject to} & \sum_{i,j:i\neq j} I(A_{ij}\neq 0) \le K\\ & \sum_{1\le k\le L:j_{L+1}=j_1}I(A_{j_{k+1}j_k}\neq 0)\le L-1\\ &\{(j_1,\ldots,j_L):j_1= \underset{1\le k\le L}{\min}j_k,j_1\neq\cdot\neq j_L, L=2,\ldots,p\} \end{align*}\]Applying their theorem and approximating the indicator functions by piecewise-linear functions to circumvent the discontinuity, the above problem can be reduced further, and they can use difference convex programming.
\[\begin{align*} \underset{(A,\lambda)}{\min} & l(A)\\ \text{subject to} & \sum_{i,j:i\neq j}J_\tau (A_{ij})\le K\\ &\lambda_{jk} + I(j\neq k) - \lambda_{jk}\ge \lambda_\tau(A_{ij}) \end{align*}\]In practice, $\tau$ and $K$ are treated as tuning parameters, optimized using an independent validation set or by cross validation over a set of grids on their domain.
They solve this constrained minimization through difference convex programming and the alternating direction method of multipliers.