Lecture 2: Bayesian Networks
Overview of Bayesian Networks, their properties, and how they can be helpful to model the joint probability distribution over a set of random variables. Concludes with a summary of relevant sections from the textbook reading.
Motivation: representing joint distributions over some random variables is computationally expensive, so we need methodologies to represent joint distributions compactly.
Two types of Graphical Models
Directed Graphs (Bayesian Networks)
An acyclic graph,
The joint probability of the above directed graph can be written as follows:
Undirected Graphs (Markov Random Fields)
An undirected graph contains nodes that are connected via non-directional edges.
The joint probability of the above undirected graph can be written as followed:
Notation
- Variable: capitalized english letter, with subscripts to represent dimensions
(i, j, k) and superscripts to represent index e.g.V_{i, j}^j . - Values of variables: a lowercase letter means it is an ‘observed value’ of some random variable e.g.
v_{i, j}^j . - Random variable: a variable with stochasticity, changing across different observations.
- Random vector: capitalized and bold letter, with random vars as entries (of dimension
1 \times n ). - Random matrix: capitalized and bold letter, with random vars as entries (of dimension
n \times m ). - Parameters: greek characters. Can be considered random variables.
The Dishonest Casino
Let
p(x=1) | p(x=2) | p(x=3) | p(x=4) | p(x=5) | p(x=6) |
---|---|---|---|---|---|
0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.5 |
Some questions we might want to ask are:
- Evaluation: How likely is the sequence, given our model of casino?
- Decoding: What portion of the sequence was generated with the fair die, and what portion with the loaded die?
- Learning: How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair die to loaded die, and back?
One way we could model this casino problem is as a Hidden Markov Model where
The hidden variables all share the Markov property that the the past is conditionally independent of the future given the present:
This property is also explicitly highlighted in the topology of the graph.
Furthermore, we can find how likely of the parse, given our HMM sequence, as followed:
The marginal and posterior distributions can be computed as follows:
- Marginal:
p(\mathbf{x}) = \sum\limits_{y_{1}}\cdots\sum\limits_{y_T} p(\mathbf{x,y}) - Posterior:
p(\mathbf{y} \mid \mathbf{x}) = \frac{p(\mathbf{x,y})}{p(\mathbf{x})}
Bayesian Network
- A BN is a directed graph whose nodes represent the random variables and whose edges represent directed influence of one variable on the another.
- It is a data structure that provides the skeleton for representing a joint distribution compactly in a factorized way.
- It offers a compact representation for a set of conditional independence assumptions about a distribution.
- We can view the graph as encoding a generative sampling process executed by nature, where
- the value for each variable is selected by the nature using a distribution that depends only on its parents. In other words, each variable is a stochastic function of its parents.
Bayesian Network: Factorization Theorem
We define
As a result, the joint probability of the above directed graph can be written as follows:
Specification of a Directed Graphical Model
There are two components to any GM:
- qualitative (topology)
- quantitative (numbers associated with each conditional distributions)
Sources of Qualitative Specifications
Where do our assumptions come from?
- Prior knowledge of causal relationship
- Prior knowledge of modular relationship
- Assessment from expert
- Learning from data
- We simply like a certain architecture (e.g. a layered graph)
- …
Local Structures & Independencies
- Common parent (also called ‘common cause’ in section 3.3.1)
- Fixing
B decouplesA andC - “Given the level of gene
B , the levels ofA andC are independent”
A \perp C \mid B \Rightarrow P(A,C \mid B) = P(A \mid B) P(C \mid B) Proof. We have the following: \begin{aligned} P(A,B,C) &= P(B)P(A \vert B)P(C \vert B) \\ P(A,C \vert B) &= P(A,B,C)/P(B) \end{aligned} Plugging in the above,P(A,C \vert B)=P(B)P(A \vert B)P(C \vert B)/P(B)=P(A \vert B)P(C \vert B)
- Fixing
- Cascade (also called ‘causal/evidential trail’ in section 3.3.1)
- Knowing
B decouplesA andC - “Given the level of gene
B , the level geneA provides no extra prediction value for the level of geneC ”
- Knowing
- V-structure (also called ‘common effect’ in section 3.3.1)
- Knowing
C couplesA andB becauseA can “explain away”B w.r.t.C - “If
A correlates toC , then chance forB to also correlate toB will decrease”
- Knowing
In class example of v-structure
My clock running late (event
I-maps
-
Definition (also see Definition 3.2-3.3): Let
P be a distribution overX . We defineI(P) to be the set of independence assertions of the form(X \perp Y \vert Z) that hold inP . -
Defn: Let
\mathcal{K} be any graph object associated with a set of independenciesI(K) . We say that\mathcal{K} is an I-map for a set of independenciesI , ifI(\mathcal{K}) \subseteq I . -
We now say that
\mathcal{G} is an I-map for\mathcal{P} if\mathcal{G} is an I-map forI(\mathcal{P}) , where we useI(\mathcal{G}) as the set of independencies associated.
Facts about I-map
-
For
\mathcal{G} to be an I-map of\mathcal{P} , it is necessary that\mathcal{G} does not mislead us regarding independencies inP :any independence that
\mathcal{G} asserts must also hold in\mathcal{P} . Conversely,\mathcal{P} may have additional independencies that are not reflected in\mathcal{G} This is formally defined in Definition 3.4.
In class examples
Below we have two tables showing the marginal distributions. Find the I-maps:
P_1: I(P_1) = {X \perp Y} (from inspection i.e.,P(X,Y)=P(X)P(Y), 0.48=0.6* 0.8 )
Solution: Graph 1.
P_2: I(P_2) = \emptyset
Solution: Both graph 2 and graph 3, since they assume no independences. Therefore, they are equivalent in terms of independent sets. Graph 1 has an independent set. Therefore it is not an I-map of
What is I(G)
Local Markov assumptions of BN
A Bayesian network structure
Local Markov assumptions
Definition
Let denote the parents of in , and denote the variables in the graph that are not descendants of
In other words, each node
D-separation criterion for Bayesian networks
Definition 1
Variables
- Example:
X \perp Y \vert Z , then we sayZ D-separatesX andY .
-
Ancestral graph (focusing only the nodes of interest): keeping only the nodes themselves + ancestors of the nodes in question.
-
Moral ancestral: remove the connectivity between nodes, but connect parents to nodes that are shared by a single child.
If there is a way to travel from one node to another node using any path (not through the given), then these two nodes are not conditionally independent (example is not conditionally independent).
Practical definition of I(\mathcal{G})
Global Markov properties of Bayesian networks
X is d-separated (directed-separated) fromZ givenY if we can’t send a ball from any node inX to any node inZ using the “Bayes-ball” algorithm illustrated bellow (and plus some boundary conditions). Note this is expressed using math in Definition 3.7.
- Definition:
I(\mathcal{G}) = all independence properties that correspond to d-separation:
I(G) example
In this graph there are two types of active trail structures (see section 3.3.1 from the reading below for definition):
- common cause:
x_3 \leftarrow x_1 \rightarrow x_4 . This trail is active if we don’t condition onx_1 . - common effect:
x_2 \rightarrow x_3 \leftarrow x_1 . This trail is active if we condition onx_3 .
To find the independencies, consider all trails with length greater than 1 (since a node cannot be independent of its parent).
Trails of length 2:
x_2 \rightleftharpoons x_3 \rightleftharpoons x_1 : due to ‘common effect’ structure, this trail is blocked as long as we do not condition onx_3 . Therefore we get(x_2 \perp x_1), (x_2 \perp x_1 \vert x_4) .x_3 \rightleftharpoons x_1 \rightleftharpoons x_4 : due to ‘common cause’ structure, this trail is blocked if we condition onx_1 . Therefore we get .
Trails of length 3 (only x_2 \rightleftharpoons x_3 \rightleftharpoons x_1 \rightleftharpoons x_4 ):
- due to ‘common effect’ structure
x_2 \rightarrow x_3 \leftarrow x_1 , this trail is blocked as long as we do not condition onx_3 . Therefore we get(x_2 \perp x_4), (x_2 \perp x_4 \vert x_1) . - due to ‘common cause’ structure
x_3 \leftarrow x_1 \rightarrow x_4 , this trail is blocked if we condition onx_1 . Therefore we get .
Trails between sets of nodes
x_2 \perp {x_1, x_4} : This is true by d-separation because we have seen that any path betweenx_2 andx_1 , or betweenx_2 andx_4 , is blocked.
Full I(\mathcal{G})
Putting the above together, and we have the following independencies.
The Equivalence Theorem
For a graph
This means separation properties in the graph imply independence properties about the associated variables.
Conditional Probability Tables (CPTs)
This is an example with discrete probabilities.
Conditional Probability Densities (CPDs)
The probabilities can also be a continuous function, e.g. the Gaussian distribution.
Summary of BN
-
Conditional independencies imply factorization.
-
Factorization according to
G implies the associated conditional independencies.
Soundness and Completeness of D-separation
Soundness and completeness are two desirable properties of d-separation, formally defined in Section 3.3.2
-
Soundness: If a distribution
P factorizes according toG , thenI(G) \subseteq I(P) . -
“Completeness”: (“Claim”) For any distribution
P that factorizes overG , if(X \perp Y \vert Z) \in I(P) , then\text{d-sep}_{\mathcal{G}} (X; Y \vert Z) ) -
Actually, the “Completeness” is not holding true all the time. “Even if a distribution factorizes over , it can still contain additional independencies that are not reflected in the structure.”
Example (follows Example 3.3 from textbook):
Consider a distribution
However, we can manipulate the conditional probability table so that independencies hold in which do not follow from D-separation. One such conditional probability table would be:
Observe that in this table,
Theorem
Let
Theorem:
For almost all distributions
Readings
The Bayesian network representation
3.1.1 Exploiting independence Properties
Standard vs compact parametrization of independent random variables
Given random variables
Note there are 2 possibilities for each outcome
One simple way to reduce the number of parameters needed, would be to represent the probability that each coin toss lands heads as
3.1.3 Naive Bayes
We can further express the joint distribution in terms of conditional probabilities. This is done in the Naive Bayes model.
Instances of this model will include:
- class: some category
C \in {c^1, …c^k} which gives a prior on the value of each feature - features: observed properties
X_1, … X_k
The model makes the strong ‘naive’ conditional independence assumption:
In words, features are conditionally independent, given the class of the instance of this model. Thus the joint distribution of the Naive Bayes model factorizes as follows:
3.2.1 Bayesian networks
Bayesian networks use a graph whose nodes are the random variables in the domain, and whose edges represent conditional probability statements. Unlike in the Naive Bayes model, Bayesian networks can also represent distributions that do not satisfy the naive conditional independence assumption.
Definition 3.1: Bayesian Network (BN)
A bayesian network
For each variable
3.2.3 Graphs and Distributions
In this section it is shown that the distribution
Definition 3.2-3.3: I-Map
Let
Let
Note this means that
I-Map to factorization
In this section it is proven (see text) that the conditional independence assumptions implied by the BN structure
Definition 3.4 Factorization
Let
Reduction in number of parameters
In a distribution over
Factorization to I-map
The converse also holds as given by the following :
Theorem 3.2
Let
Box 3.C Knowledge engineering
Building a BN in the real world requires many steps including:
- picking variables precisely: we need some variables that can be observed, and their domain should be specified. Sometimes introducing hidden variables will help to render the observed variables conditionally independent.
- picking a structure consistent with the causal order: choose enough variables to approximate the causal relationships.
- picking probability estimates for events in the network: data collection may be difficult but small errors have little effect (though the network is sensitive to events assigned a zero probability, and to relative size of probabilities of events)
Done correctly, the model will be useful as well as not too complex to use (see Box 3.D).
3.3.1 D-separation
Objective: determine which independencies hold for every distribution
Definition 2.16 Trail
We say that
Types of active two edge trails
By examining 2-edge connections between nodes
- Causal trail
X \rightarrow Z \rightarrow Y : active iff Z is not observed. - Evidential trail
X \leftarrow Z \leftarrow Y : active iff Z is not observed. - Common cause
X \leftarrow Z \rightarrow Y : active iff Z is not observed. - Common effect
X \rightarrow Z \leftarrow Y : active iff Z or one of its descendants is observed.
For influence to flow through a longer trail
Definition 3.6 active trail
Let
- Whenever we have a v-structure
X_{I-1} \to X_i \leftarrow X_{I+1} thenX_i or one of its descendants are inZ - No other node along the trail is in
Z
Further, for directed graphs that may have more than one trail between nodes “directed separation” (d-separation) gives a notion of separation between nodes.
Definition 3.7 D-separation
Let
3.3.2 Soundness and Completeness
As a method d-separation has the following properties (proof in text):
Soundness:
Completeness: If
Footnotes
-
Proof. We have the following:
\begin{aligned} P(A,B,C) &= P(B)P(A \vert B)P(C \vert B) \\ P(A,C \vert B) &= P(A,B,C)/P(B) \end{aligned} Plugging in the above,P(A,C \vert B)=P(B)P(A \vert B)P(C \vert B)/P(B)=P(A \vert B)P(C \vert B) [↩]
References
- The Basics of Graphical Models
Blei, D.M., 2015. Columbia Univeristy. - An Introduction to Probablistic Graphical Models
Jordan, M.I., 2003. University of California, Berkley. - Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
Koller, D. and Friedman, N., 2009. The MIT Press.