Lecture 9: Modeling Networks
Classic network learning algorithms.
Network research - study at graph as object
How do we estimate graphs given real-world data? Different generation rules will produce different graphs. We explore such methods here.
Structural Learning
Trees: The Chow-Liu algorithm
The Chow-Liu algorithm directly searches for the optimal tree structure.
Pairwise Markov Random Fields
The key idea for graph structure learning is that we should view network inference as parameter estimation. Every node in the graph can be either binary or continuous number. All the variables are connected pairwise, which gives us pairwise Markov Random Field, also Boltzmann Machine. The joint probability of all the nodes is represented as
we use non-zero parameters to represent the existence of edges. This equivalence is prominent because we turn topology into continuous space.
Structure of many classic models can be learned by this algorithm. For discrete node states, we have the Ising/Potts model; for continuous nodes, we have the Gaussian graphical model; finally, we can also have the heterogeneous model with both discrete and continuous nodes. The key take away is parameter matrix encodes graph structure (non zero
Multivariant Gaussian
Recall the definition of multivariant Gaussian
Let
This equation can be viewed as a continuous Markov Random Field with potentials defined on every node and edge.
Gaussian Graphical Model
Consider the model on a zero-mean Gaussian distribution
where
Markov vs Correlation Network
While these two models have their similarity, Markov fits more on the real world because it can model conditional probabilities.
Correlation network
Correlation network is based on covariance matrix
Gaussian Graphical Model
A Gaussian Graphical Model (GGM) is a Markov Network based on precision matrix
However, there are still noticeable problems with learning the dependencies of Gaussian Graphical Model. First, the algorithm wants the precision matrix to be invertible, but non-full rank does not mean MN does not exist. Also, with a small sample size, empirical covariance matrix cannot be inverted. Second, the computational complexity of inverting a matrix is
Now that our purpose is to obtain the precision matrix, so we need to make an assumption to avoid these problems.
Prior Assumption of GGM - Sparsity
We make a common assumption that the precision matrix
After making this assumption, it is still hard to get the whole precision matrix. What if we get the precision matrix row by row or column by column. Then it came to the basic method to make this simpler thought possible.
GGM with Lasso
Our target is to select the neighborhood of each node so we can perform regression of all nodes to express the relationship between two nodes according to the parameters in the regression. However, there will be a problem that as long as we use the normal regression to one node, other nodes in the network would be the neighborhood of this node since the parameters are non-zero. That is the reason why we need LASSO regression, which can cause zero parameters through penalty. Therefore, the regression problem at each node can be formulated the following LASSO problem:
where
Only when the following assumptions are met, LASSO regression can asymptotically recover correct subset of covariates that relevant.
- Dependency Condition: Relevant Covariates are not overly dependent
- Incoherence Condition: Large number of irrelevant covariates cannot be too correlated with relevant covariates
- Strong concentration bounds: Sample quantities converge to expected values quickly
And theoretically, there is a proof that graphical lasso can recover the true structure of the graph
then with high possibility,
Why does this algorithm work?
After knowing the process of this algorithm to obtain the precision matrix through repeatedly applying LASSO regression on each node, we want to know why we can get the precision matrix in this way. In other words, we want to know why LASSO regression can select the neighborhood of each node.
Multivariate Gaussian
For the multivatiate Gaussian there are several formulas to remember
where we have
The matrix inverse lemma
Consider a block-partitioned matrix:
We diagonalize
According to the Schur complement
The matrix inverse lemma is
The covariance and the precision matrices
If we have the covariance matrix
Also recall the facts about matrix inverse derived in the previous section, we can have the precision matrix:
Justification
With the above three facts, one is ready to justify why the problem can be formulated as a LASSO variable selection problem. Given a Gaussian distribution, the conditional distribution of a single node i given the rest of the nodes can be written as:
Let
From here we can already see that the value of a certain node is determined by the linear representation of the other nodes plus a Gaussian noise:
Neighborhood estimation based on auto-regression coeficient
If we are given the estimation of the neighborhood
Therefore, the neighborhood
Discrete Models
Given vector
If our observation are discrete, we can still estimate a graph in the same spirit as in the continuous case shown above. However, we can not do linear regression in the discrete case. So instead we will employ logistic regression. Namely,
We can now apply similar techniques to those shown above, but with this new loss function. Implicitly, we assume, as we did before, that each observation
Evolving Networks
Networks are not always constant across time. Let’s assume that you have data from varying time points. It may not be a reasonable assumption that each data point is generated from the same network. Now, a reasonable question to ask is, “At some given time T, what would our network be”? This violates our independent and identically distributed (
We introduce the Kernel Weighted
such that:
Here, we are assuming that we still have the full data set. We weight each datapoint based on it’s relationship to our current time. For example, points that are closer to our time point T are weighted more than time points further away. Specifically, we can assume that data taken closer to time point T are more reflective of our graph at time T that points sampled at a more distant time. We can think of this weight as saying “how much do I trust this data point to inform about the graph at time point T”. The selection of these weights is typically done using a kernel (e.g. a gaussian kernel). Now we can use our T data points to estimate T graphs instead of a single graph. It can be shown that this method has the same consistency as if we had used i.i.d data with a few reasonable assumptions, such that our graph changes smoothly across time.
While not directly covered, in lecture, the Temporally Smoothed
such that
This is equivlent to solving the following constrained optimization problem:
This algorithm serves as an extension of the temporally smoothed
References
- Lasso-type recovery of sparse representations for high-dimensional data
Meinshausen, N., Yu, B. and others,, 2009. The annals of statistics, Vol 37(1), pp. 246--270. Institute of Mathematical Statistics. - Recovering time-varying networks of dependencies in social and biological studies
Ahmed, A. and Xing, E.P., 2009. Proceedings of the National Academy of Sciences, Vol 106(29), pp. 11878--11883. National Acad Sciences.