Reconstruct Gaussian DAG
Posted on
This note is based on Yuan, Y., Shen, X., Pan, W., & Wang, Z. (2019). Constrained likelihood for reconstructing a directed acyclic Gaussian graph. Biometrika, 106(1), 109–125.
DAG is widely used to describe directional pairwise relations. The relations are estimated by reconstructing a directed acyclic graph’s structure, which is challenging when the ordering of nodes of the graph is unknown.
Motivation
Existing methods such as the neighbourhood and search-and-score methods have high estimation errors or computational complexities, especially when a local or sequential approach is used to enumerate edge directions by testing or optimizing a criterion locally, as a local method may break down even for moderately sized graphs.
Proposal
A novel approach to simultaneously identifying all estimable directed edges and model parameters, using constrained maximum likelihood with nonconvex constraints.
Develop a constraint reduction method that constructs a set of active constraints from super-exponentially many constraints.
Coupled with an alternating direction method of multipliers and a difference convex method, permits efficient computation for large-graph learning.
The proposed method consistently reconstructs identifiable directions of the true graph and achieves the optimal performances in terms of parameter estimation.
Numerically, the method compares favourably with competitors. A protein network is analysed to demonstrate that the proposed method can make a difference in identifying the network’s structure.
Literatures
Two types
local conditional independence tests sequentially to check the existence of directed edges
usually have worst-case complexity that is exponential in the number of nodes
- PC algorithm: O(pq) with p nodes and maximal neighborhood size q.
- Search-and-score: optimize a goodness-of-fit measure for possible directions in a neighborhood of interest.
- L1 penalization method for learning a directed acyclic graphical model with a known topological ordering of the nodes.
- L1 penalization method for interventional data
- a high-dimensional additive structural equation model based on order search and penalized regression.
exact learning method
- dynamic programming: guarantee a global minimizer but has exponential time complexity and is feasible only for small graphs.
- mixed integer linear programming: may reduce the complexity to a polynomial order when the maximal number of parent nodes is restricted to one or two in the search. such a restriction limits the scope of application and may destroy the acyclically requirement of a directed acyclic graph and hence the local Markov property.
the paper claims that they can employ an iterative difference convex algorithm to seek a good local minimizer with fast convergence.
theoretical challenges:
- the reconstruction error of a directed acyclic graph may grow super-exponentially in p
- little theory to guide practice for reconstructing a directed acyclic graph
Gaussian directed acyclic graph
The joint distribution of (X1,…,Xp) is ∏pj=1pr(Xj∣paj). The factorization property is equivalent to the local Markov property of a directed acyclic graph.
Embed directional effects induced by directed edges into a structural equation model:
Xj=fj(paj,Zj)where Zj is the latent error representing unexplained variation in node j. Consider a Gaussian structural equation model in which each fj(⋅,⋅) becomes linear in (paj,Zj) and Zj follows an independent N(0,σ2j) distribution,
Xj=∑k:k≠jAjkXk+ZjZj∼N(0,σ2j),where A is the parametrized adjacency matrix in which a nonzero (j,k) element Ajk corresponds to a directed edge from parent node k to child node j, with its value Ajk indicating the strength of the relation. Our goal is to estimate A subject to the requirement that A defines a directed acyclic graph.
DAG models have been under-studies compared with undirected graphs. The major challenge stems from acyclically and the corresponding nonconvex parameter space, defined by nonconvex constraints reinforcing acyclically of a graph. In contrast, the parameter space for Gaussian UG models is a positive-semidefinite cone, which a convex.
Denote a directed cycle by (j1,…,jL), a sequence of indices of nodes, where L is the length of the directed cycle, and j1 is required to be the smallest index. A DAG doesn’t contain a directed cycle. This implies that the number of directed edges is smaller than the number of nodes involved in every possible directed cycle. For example, to prevent a directed cycle, say (1,2,3), we introduce the constraint
I(A21≠0)+I(A32≠0)+I(A13≠0)≤2.On this basis, we introduce constraints on entries of A to reinforce acyclically of any order.
∑jL+1=j1:1≤k≤LI(Ajk+1jk≠0)≤L−1{(j1,…,jL):j1=min1≤k≤Ljk,j1≠⋯≠jL,L=2,…,p}The total number of constraints is
(p2)+(p2)2!+⋯+(pp)(p−1)!=O(pp).Theorem 1 implies that the O(pp) constraints can be reduced to p3−p2 constraints.
Constrained Maximum Likelihood
Suppose that σj(j=1,…,p) are equal, we have
Xj=∑k:k≠jAjkXk+Zjwhere all the Zj are independent and identically distributed according to N(0,σ2) with σ>0.
Given an n×p data matrix xijn×p, the negative log-likelihood, after dropping a constant term, is
l(A)=12p∑j=1n∑i=1(xij−∑k:k≠jxikAjk)2,which is convex in A.
In addition to the constraints on acyclicity, also impose a constraint to regularize its sparsity. Then the directed acyclic graph learning problem can be formulated as
minAl(A)subject to∑i,j:i≠jI(Aij≠0)≤K∑1≤k≤L:jL+1=j1I(Ajk+1jk≠0)≤L−1{(j1,…,jL):j1=min1≤k≤Ljk,j1≠⋅≠jL,L=2,…,p}Applying their theorem and approximating the indicator functions by piecewise-linear functions to circumvent the discontinuity, the above problem can be reduced further, and they can use difference convex programming.
min(A,λ)l(A)subject to∑i,j:i≠jJτ(Aij)≤Kλjk+I(j≠k)−λjk≥λτ(Aij)In practice, τ and K are treated as tuning parameters, optimized using an independent validation set or by cross validation over a set of grids on their domain.
They solve this constrained minimization through difference convex programming and the alternating direction method of multipliers.