WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Reconstruct Gaussian DAG

Posted on
Tags: DAG, Nonconvex, Optimization

This note is based on Yuan, Y., Shen, X., Pan, W., & Wang, Z. (n.d.). Constrained likelihood for reconstructing a directed acyclic Gaussian graph. Biometrika.

DAG is widely used to describe directional pairwise relations. The relations are estimated by reconstructing a directed acyclic graph’s structure, which is challenging when the ordering of nodes of the graph is unknown.

Motivation

Existing methods such as the neighbourhood and search-and-score methods have high estimation errors or computational complexities, especially when a local or sequential approach is used to enumerate edge directions by testing or optimizing a criterion locally, as a local method may break down even for moderately sized graphs.

Proposal

A novel approach to simultaneously identifying all estimable directed edges and model parameters, using constrained maximum likelihood with nonconvex constraints.

Develop a constraint reduction method that constructs a set of active constraints from super-exponentially many constraints.

Coupled with an alternating direction method of multipliers and a difference convex method, permits efficient computation for large-graph learning.

The proposed method consistently reconstructs identifiable directions of the true graph and achieves the optimal performances in terms of parameter estimation.

Numerically, the method compares favourably with competitors. A protein network is analysed to demonstrate that the proposed method can make a difference in identifying the network’s structure.

Literatures

Two types

local conditional independence tests sequentially to check the existence of directed edges

usually have worst-case complexity that is exponential in the number of nodes

  • PC algorithm: $O(p^q)$ with $p$ nodes and maximal neighborhood size $q$.
  • Search-and-score: optimize a goodness-of-fit measure for possible directions in a neighborhood of interest.
  • $L_1$ penalization method for learning a directed acyclic graphical model with a known topological ordering of the nodes.
  • $L_1$ penalization method for interventional data
  • a high-dimensional additive structural equation model based on order search and penalized regression.

exact learning method

  • dynamic programming: guarantee a global minimizer but has exponential time complexity and is feasible only for small graphs.
  • mixed integer linear programming: may reduce the complexity to a polynomial order when the maximal number of parent nodes is restricted to one or two in the search. such a restriction limits the scope of application and may destroy the acyclically requirement of a directed acyclic graph and hence the local Markov property.

the paper claims that they can employ an iterative difference convex algorithm to seek a good local minimizer with fast convergence.

theoretical challenges:

  1. the reconstruction error of a directed acyclic graph may grow super-exponentially in $p$
  2. little theory to guide practice for reconstructing a directed acyclic graph

Gaussian directed acyclic graph

The joint distribution of $(X_1,\ldots,X_p)$ is $\prod_{j=1}^p\pr(X_j\mid \pa_j)$. The factorization property is equivalent to the local Markov property of a directed acyclic graph.

Embed directional effects induced by directed edges into a structural equation model:

\[X_j = f_j(\pa_j,Z_j)\]

where $Z_j$ is the latent error representing unexplained variation in node $j$. Consider a Gaussian structural equation model in which each $f_j(\cdot,\cdot)$ becomes linear in $(\pa_j,Z_j)$ and $Z_j$ follows an independent $N(0,\sigma_j^2)$ distribution,

\[X_j=\sum_{k:k\neq j}A_{jk}X_k+Z_j\,\qquad Z_j\sim N(0,\sigma_j^2)\,,\]

where $A$ is the parametrized adjacency matrix in which a nonzero $(j,k)$ element $A_{jk}$ corresponds to a directed edge from parent node $k$ to child node $j$, with its value $A_{jk}$ indicating the strength of the relation. Our goal is to estimate $A$ subject to the requirement that $A$ defines a directed acyclic graph.

DAG models have been under-studies compared with undirected graphs. The major challenge stems from acyclically and the corresponding nonconvex parameter space, defined by nonconvex constraints reinforcing acyclically of a graph. In contrast, the parameter space for Gaussian UG models is a positive-semidefinite cone, which a convex.

Denote a directed cycle by $(j_1,\ldots,j_L)$, a sequence of indices of nodes, where $L$ is the length of the directed cycle, and $j_1$ is required to be the smallest index. A DAG doesn’t contain a directed cycle. This implies that the number of directed edges is smaller than the number of nodes involved in every possible directed cycle. For example, to prevent a directed cycle, say $(1,2,3)$, we introduce the constraint

\[I(A_{21}\neq 0) + I(A_{32}\neq 0) + I(A_{13}\neq 0)\le 2\,.\]

On this basis, we introduce constraints on entries of $A$ to reinforce acyclically of any order.

\[\begin{align*} \sum_{j_{L+1}=j_1:1\le k\le L}I(A_{j_{k+1}j_k}\neq 0)\le L-1\\ \{(j_1,\ldots,j_L):j_1=\underset{1\le k\le L}{\min}j_k,j_1\neq\cdots\neq j_L,L=2,\ldots,p\} \end{align*}\]

The total number of constraints is

\[\binom{p}{2} + \binom{p}{2}2!+\cdots + \binom{p}{p}(p-1)!=O(p^p)\,.\]

Theorem 1 implies that the $O(p^p)$ constraints can be reduced to $p^3-p^2$ constraints.

Constrained Maximum Likelihood

Suppose that $\sigma_j(j=1,\ldots,p)$ are equal, we have

\[X_j=\sum_{k:k\neq j}A_{jk}X_k+Z_j\]

where all the $Z_j$ are independent and identically distributed according to $N(0,\sigma^2)$ with $\sigma > 0$.

Given an $n\times p$ data matrix ${x_{ij}}_{n\times p}$, the negative log-likelihood, after dropping a constant term, is

\[l(A) = \frac 12\sum_{j=1}^p\sum_{i=1}^n\Big(x_{ij}-\sum_{k:k\neq j}x_{ik}A_{jk}\Big)^2\,,\]

which is convex in $A$.

In addition to the constraints on acyclicity, also impose a constraint to regularize its sparsity. Then the directed acyclic graph learning problem can be formulated as

\[\begin{align*} \underset{A}{\min} & l(A)\\ \text{subject to} & \sum_{i,j:i\neq j} I(A_{ij}\neq 0) \le K\\ & \sum_{1\le k\le L:j_{L+1}=j_1}I(A_{j_{k+1}j_k}\neq 0)\le L-1\\ &\{(j_1,\ldots,j_L):j_1= \underset{1\le k\le L}{\min}j_k,j_1\neq\cdot\neq j_L, L=2,\ldots,p\} \end{align*}\]

Applying their theorem and approximating the indicator functions by piecewise-linear functions to circumvent the discontinuity, the above problem can be reduced further, and they can use difference convex programming.

\[\begin{align*} \underset{(A,\lambda)}{\min} & l(A)\\ \text{subject to} & \sum_{i,j:i\neq j}J_\tau (A_{ij})\le K\\ &\lambda_{jk} + I(j\neq k) - \lambda_{jk}\ge \lambda_\tau(A_{ij}) \end{align*}\]

In practice, $\tau$ and $K$ are treated as tuning parameters, optimized using an independent validation set or by cross validation over a set of grids on their domain.

They solve this constrained minimization through difference convex programming and the alternating direction method of multipliers.


Published in categories Note