Bayesian Network Là Gì

By Kevin Murphy, 1998.

Bạn đang xem: Bayesian network là gì

"Graphical models are a marriage between probability theory andgraph theory. They provide a natural tool for dealing with two problemsthat occur throughout applied and engineering --uncertainty and complexity -- and in particular they are playing anincreasingly important role in the design and analysis of machinelearning algorithms. Fundamental to the idea of a graphical model isthe notion of modularity -- a complex system is built by combiningsimpler parts. Probability theory provides the glue whereby the partsare combined, ensuring that the system as a whole is consistent, andproviding ways to interface models to data. The graph theoretic sideof graphical models provides both an intuitively appealing interfaceby which humans can model highly-interacting sets of variables as wellas a data structure that lends itself naturally to the design ofefficient general-purpose algorithms.Many of the classical multivariate probabalistic systems studied infields such as, systems engineering, information theory,pattern recognition and statistical are special cases of thegeneral graphical model formalism -- examples include mixture models,factor analysis, hidden Markov models, Kalman filters and Isingmodels. The graphical model framework provides a way to view all ofthese systems as instances of a common underlying formalism. This viewhas many advantages -- in particular, specialized techniques that havebeen developed in one field can be transferred between researchcommunities and exploited more widely. Moreover, the graphical modelformalism provides a natural framework for the design of new systems."--- Michael Jordan, 1998.

This tutorial

We will briefly discuss the following Representation, or, what exactly is agraphical model? Inference, or, how can we use these modelsto efficiently answer probabilistic queries? Learning, or, what do we do if we don"t know what themodel is? Decision theory, or, what happens when itis time to convert beliefs into actions? Applications, or, what"s this all good for, anyway?Note: (a version of) this page is availablein pdf formathere.Also, Marie Stefanova has madea Swedish translationhere.

Articles in the popular press

Other sources of technical information

My tutorial on Bayes rule AUAI homepage (Association for Uncertainty in Artificial Intelligence) The UAImailing list UAI proceedings. My list of recommended reading . Bayes Net softwarepackages My Bayes Net Toolbox forMatlab Tutorial slides on graphical modelsand BNT, presented to the Mathworks, May 2003 Listof other Bayes net tutorials RepresentationProbabilistic graphical models are graphs in which nodes representrandom variables, and the (lack of) represent conditionalindependence assumptions.Hence they provide a compact representation of joint probabilitydistributions.Undirected graphical models, also called MarkovRandom Fields (MRFs) or Markov networks,have a simple definition of independence: two (sets of) nodes A and B are conditionally independent given a third set, C,if all paths between the nodes in A and B are separated by a node in C.By contrast, directed graphical modelsalso called Bayesian Networks or Belief Networks (BNs), have a morecomplicated notion of independence,which takes into account the directionality of, as we explain below.Undirected graphical models are more popular with the andvision communities, and directed models are more popular with the AIand communities. (It is possible to have a model with bothdirected and undirected, which is called a chain graph.)For a careful study of the relationship between directed andundirected graphical models, see the books by Pearl88, Whittaker90,and Lauritzen96.Although directed models have a more complicated notion ofindependence than undirected models,they do have several advantages.The most important is thatone can regard an arc from A to B asindicating that A ``causes"" B. (See the discussion on causality.)This can be used as a guide to construct the graph structure.In addition, directed models can encode deterministicrelationships, and are easier to learn (fit to data).In the rest of this tutorial, we will only discuss directed graphicalmodels, i.e., Bayesian networks.In addition to the graph structure, it is necessary to specify theparameters of the model.For a directed model, we must specifythe Conditional Probability Distribution (CPD) at each node.If the variables are discrete, this can be represented as a table(CPT), which lists the probability that the child node takes on eachof its different values for each combination of values of itsparents. Consider the following example, in which all nodes are binary,i.e., have two possible values, which we will denote by T (true) andF (false).We see that the event "grass is wet" (W=true) has two possible causes: either the water sprinker is on (S=true) or it israining (R=true).The strength of this relationship is shown in the table.For example, we see that Pr(W=true | S=true, R=false) = 0.9 (secondrow), andhence, Pr(W=false | S=true, R=false) = 1 - 0.9 = 0.1, since each rowmust sum to one.Since the C node has no parents, its CPT specifies the priorprobability that it is cloudy (in this case, 0.5).(Think of C as representing the season:if it is a cloudy season, it is less likely that the sprinkler is onand more likely that the rain is on.)The simplest conditional independence relationship encoded in a Bayesiannetwork can be stated as follows:a node is independent of its ancestors given its parents, where theancestor/parent relationship is with respect to some fixed topologicalordering of the nodes.By the chain rule of probability,the joint probability of all the nodes in the graph above isP(C, S, R, W) = P(C) * P(S|C) * P(R|C,S) * P(W|C,S,R)By using conditional independence relationships, we can rewrite this asP(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R)where we were allowed to simplify the third term because R isindependent of S given its parent C, and the last term because W isindependent of C given its parents S and R.We can see that the conditional independence relationshipsallow us to represent the joint more compactly.Here the savings are minimal, but in general, if we had n binarynodes, the full joint would require O(2^n) space to represent, but thefactored form would require O(n 2^k) space to represent, where k isthe maximum fan-in of a node. And fewer parameters makes learning easier.

Are "Bayesian networks" Bayesian?

Despite the name,Bayesian networks do not necessarily imply a commitment to Indeed, it is common to use frequentists methods to estimate the parameters of the CPDs.Rather, they are so called because they useBayes" rule forprobabilistic inference, as we explain below.(The term "directed graphical model" is perhaps more appropriate.)Nevetherless, Bayes nets are a useful representation for hierarchicalBayesian models, which form the foundation of applied e.g., the BUGS project).In such a model, the parameters are treated like any other randomvariable, and becomes nodes in the graph.


The most common task we wish to solve using Bayesian networks isprobabilistic inference. For example, consider the water sprinklernetwork, and suppose we observe thefact that the grass is wet. There are two possible causes for this:either it is raining, or the sprinkler is on. Which is more likely?We can use Bayes" rule to compute the posterior probability of eachexplanation (where 0==false and 1==true).whereis a normalizing constant, equal to the probability (likelihood) ofthe data.So we see that it is more likely that the grass is wet becauseit is raining:the likelihood ratio is 0.7079/0.4298 = 1.647.

Explaining away

In the above example, notice that the two causes "compete" to "explain" theobserved data. Hence S and R become conditionally dependent given thattheir common child, W, is observed, even though they are marginallyindependent. For example,suppose the grass is wet, but that we also know that it is raining.Then the posterior probability that the sprinkler is on goes down:Pr(S=1|W=1,R=1) = 0.1945This is called "explaining away".In, this is known as Berkson"s paradox, or "selectionbias". For a dramatic example of this effect, consider a college whichadmits students who are either brainy or sporty (or both!).Let C denote the event that someone is admitted to college, which ismade true if they are either brainy (B) or sporty (S).Suppose in the general population, B and S are independent.We can model our conditional independence assumptions using a graphwhich is a V structure, with arrows pointing down: B S \ / v CNow look at a population of college students (those for which C isobserved to be true).It will be found that being brainy makes you less likely to be sportyand vice versa, because either property alone is sufficient to explainthe evidence on C(i.e., P(S=1 | C=1, B=1) try this little BNT demo!)

Top-down and bottom-up reasoning

In the water sprinkler example, we had evidence of an effect (wet grass), andinferred the most likely cause. This is called diagnostic, or "bottomup", reasoning, since it goesfrom effects to causes; it is a common task in expert systems.Bayes nets can also be used for causal, or "top down",reasoning. For example, we can compute the probability that the grasswill be wet given that it is cloudy.Hence Bayes nets are often called "generative" models, because theyspecify how causes generate effects.


One of the most exciting things about Bayes nets is that they can beused to put discussions about causality on a solid mathematical basis.One very interesting question is: can we distinguish causation from merecorrelation? The answer is "sometimes",but you need to measure the relationships between at leastthree variables; the intution is that one of the variables actsas a "virtual control" for the relationship between the other two,so we don"t always need to do experiments to infer causality.See the following books for details. "Causality: Models,Reasoning and Inference",Judea Pearl, 2000, Cambridge University Press. "Causation,Prediction and Search", Spirtes, Glymour and Scheines, 2001 (2nd edition), MIT Press. "Causeand Correlation in Biology", Bill Shipley, 2000,Cambridge University Press. "Computation, Causation and Discovery", Glymour and Cooper (eds),1999, MIT Press.

Conditional independence in Bayes Nets

In general,the conditional independence relationships encoded by a Bayes Netare best be explained by means of the "Bayes Ball"algorithm (due to Ross Shachter), which is as follows:Two (sets of) nodes A and B are conditionally independent(d-separated) given a set Cif and only if there is noway for a ball to get from A to B in the graph, where the allowablemovements of the ball are shown below.Hidden nodes are nodes whose valuesare not known, and are depicted as unshaded; observed nodes (the oneswe condition on) are shaded.The dotted indicate direction of flow of the ball.The most interesting case is the first column, when we have two arrows converging on anode X (so X is a "leaf" with two parents).If X is hidden, its parents are marginally independent, and hence theball does not pass through (the ball being "turned around" isindicated by the curved arrows); but if X is observed, the parents becomedependent, and the ball does pass through,because of the explaining away phenomenon.Notice that, if this graph was undirected, the childwould always separate the parents; hence when converting a directedgraph to an undirected graph, we must add links between "unmarried"parents who share a common child (i.e., "moralize" the graph) to prevent us reading off incorrectindependence statements.Now consider the second column in which we have two diverging arrows from X (so X is a"root").If X is hidden,the children are dependent, because they have a hidden common cause,so the ball passes through.If X is observed, its children are rendered conditionallyindependent, so the ball does not pass through.Finally, consider the case in which we have one incoming and outgoingarrow to X. It is intuitive that the nodes upstreamand downstream of X are dependent iff X is hidden, becauseconditioning on a node breaks the graph at that point.

Bayes nets with discrete and continuous nodes

The introductory example used nodes with categorical values and multinomial distributions.It is also possible to create Bayesian networks with continuous valued nodes.The most common distribution for such variables is the Gaussian.For discrete nodes with continuous parents, we can use thelogistic/softmax distribution.Using multinomials, conditional Gaussians, and the softmaxdistribution, we can have a rich toolbox for making complex models.Some examples are shown below. For details, clickhere. (Circles denote continuous-valued random variables,squares denote discrete rv"s, clearmeans hidden, and shaded means observed.)
For more details, see this excellent paper. A Unifying Review of Linear Gaussian Models,Sam Roweis & Zoubin Ghahramani.Neural Computation 11(2) (1999) pp.305-345

Temporal models

Dynamic Bayesian Networks (DBNs) are directed graphical models of stochasticprocesses.They generalise hidden Markov models (HMMs)and linear dynamical systems (LDSs)by representing the hidden (and observed) state in terms of statevariables, which can have complex interdependencies.The graphical structure provides an easy way to specify theseconditional independencies, and hence to provide a compactparameterization of the model.Note that "temporal Bayesian network" would be a better name than"dynamic Bayesian network", sinceit is assumed that the model structure does not change, butthe term DBN has become entrenched.We also normally assume that the parameters do notchange, i.e., the model is time-invariant.However, we can always add extrahidden nodes to represent the current "regime", thereby creatingmixtures of models to capture periodic non-stationarities.There are some cases where the size of the state space can change overtime, e.g., tracking a variable, but unknown, number of objects.In this case, we need to change the model structure over time.

Hidden Markov Models (HMMs)

The simplest kind of DBN is a Hidden Markov Model (HMM), which hasone discrete hidden node and one discrete or continuousobserved node per slice. We illustrate this below.As before, circles denote continuous nodes, squares denotediscrete nodes, clear means hidden, shaded means observed.
We have "unrolled" the model for 4 "time slices" -- the structure and parameters areassumed to repeat as the model is unrolled further.Hence to specify a DBN, we need todefine the intra-slice topology (within a slice),the inter-slice topology (between two slices),as well as the parameters for the first two slices.(Such a two-slice temporal Bayes net is often called a 2TBN.)Some common variants on HMMs are shown below.

Linear Dynamical Systems (LDSs) and Kalman filters

A Linear Dynamical System (LDS) has the same topology as an HMM, butall the nodes are assumed to have linear-Gaussian distributions, i.e., x(t+1) = A*x(t) + w(t), w ~ N(0, Q), x(0) ~ N(init_x, init_V) y(t) = C*x(t) + v(t), v ~ N(0, R)The Kalman filter is a way of doing online filtering in this model.Some simple variants of LDSs are shown below.
The Kalman filter has been proposed as a model for how the brainintegrates visual cues over time to infer the state of the world,although the reality is obviously much more complicated.The main point is not that the Kalman filter is the right model, but thatthe brain is combining bottom up and top down cues.The figure below is from a paper called"A Kalman Filter Model of the Visual Cortex",by P. Rao, Neural Computation 9(4):721--763, 1997.

More complex DBNs

It is also possible to create temporal models with much more complicatedtopologies, such as the Bayesian Automated Taxi (BAT) network shownbelow.(For simplicity, we only show the observed leaves for slice 2.Thanks to Daphne Koller for providing this figure.)
When some of the observed nodes are thought of as inputs (actions), and some asoutputs (percepts), the DBN becomes a POMDP.See also the section on decision theory below.

A generative model for generative models

The figure below, produced by Zoubin Ghahramani and Sam Roweis, is agood summary of the relationships between some popular graphical models.
INFERENCEA graphical model specifies a completejoint probability distribution (JPD) over all the variables.Given the JPD, we can answer all possible inferencequeries by marginalization (summing out over irrelevant variables), asillustrated in the introduction. However, the JPD has size O(2^n), where n is thenumber of nodes, and we have assumed each node can have 2states. Hence summing over the JPD takes exponential time.We now discuss more efficient methods.

Variable elimination

For a directed graphical model (Bayes net), we can sometimes use the factored representation of the JPDto do marginalisation efficiently.The key idea is to "push sums in" as far as possible when summing(marginalizing) out irrelevant terms, e.g., for the water sprinkler networkNotice that, as we perform the innermost sums, we create new terms,which need to be summed over in turn e.g.,whereContinuing in this way,whereThis algorithm is called Variable Elimination.The principle of distributing sums over products can be generalizedgreatly to apply to any commutative semiring.This forms the basis of many common algorithms, such as Viterbidecoding and the Fast Fourier Transform. For details, see R. McEliece and S. M. Aji, 2000.The Generalized Distributive Law,IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000),pp. 325--343. F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001.Factor graphs and the sum-product algorithmIEEE Transactions on Information Theory, February, 2001.The amount of work we perform when computing a marginal is bounded bythe size of the largest term that we encounter. Choosing a summation(elimination) ordering tominimize this is NP-hard, although greedy algorithms work well inpractice.

Xem thêm: Card Intel Hd Graphics Chơi Được Game Gì ? Đồ Họa Ra Sao?Acup

Dynamic programming

If we wish to compute several marginals at the same time, we can use DynamicProgramming (DP) to avoid the redundant computation that would be involvedif we used variable elimination repeatedly.If the underlying undirected graph of the BN is acyclic (i.e., a tree), we can use alocal message passing algorithm due to Pearl.This is a generalization of the well-known forwards-backwardsalgorithm for HMMs (chains).For details, see "Probabilistic Reasoning in Intelligent Systems", Judea Pearl,1988, 2nd ed. "Fusion and propogation with multiple observations in belief networks", Peot and Shachter, AI 48 (1991) p. 299-318.If the BN has undirected cycles (as in the water sprinkler example), local message passing algorithms run the risk of double counting.e.g., the information from S and R flowinginto W is not independent, because it came from a common cause, C.The most common approach is therefore to convert the BN into a tree,by clustering nodes together, to form what is called ajunction tree, and then running a local message passing algorithm onthis tree. The message passing scheme could be Pearl"s algorithm, butit is more common to use a variant designed for undirected models.For more details, click hereThe running time of the DP algorithms is exponential in the size ofthe largest cluster (these clusters correspond to the intermediateterms created by variable elimination). This size is called theinduced width of the graph. Minimizing this is NP-hard.

Approximation algorithms

Many models of interest,such as those with repetitive structure, as inmultivariate time-series or image analysis,have large induced width, which makes exactinference very slow.We must therefore resort to approximation techniques.Unfortunately, approximate inference is #P-hard, but we can nonetheless come upwith approximations which often work well in practice. Below is a listof the major techniques. Variational methods.The simplest example is the mean-field approximation,which exploits the law oflarge numbers to approximate large sums of random variables by theirmeans. In particular, we essentially decouple all the nodes, andintroduce a new parameter, called a variational parameter, for eachnode, and iteratively update these parameters so as to minimize thecross-entropy (KL distance) between the approximate and trueprobability distributions. Updating the variational parameters becomes a proxy forinference. The mean-field approximation produces a lower bound on thelikelihood. More sophisticated methods are possible, which givetighter lower (and upper) bounds. Sampling (Monte Carlo) methods. The simplest kind is importancesampling, where we draw random samples x from P(X), the (unconditional)distribution on the hidden variables, andthen weight the samples by their likelihood, P(y|x), where y is theevidence. A more efficient approach in high dimensions is called MonteCarlo Markov Chain (MCMC), and includes as special cases Gibbs sampling and the Metropolis-Hasting algorithm. "Loopy belief propogation". This entailsapplying Pearl"s algorithm to the originalgraph, even if it has loops (undirected cycles).In theory, this runs the risk of double counting, but Yair Weiss andothers have proved that in certain cases (e.g., a single loop), events are double counted"equally", and hence "cancel" to give the right answer.Belief propagation is equivalent to exact inference on a modifiedgraph, called the universal cover or unwrapped/ computation tree,which has the same local topology as the original graph.This is the same as the Bethe and cavity/TAP approaches in there is a deep connection betweenbelief propagation and variational methods that people are currently investigating. Bounded cutset conditioning. By instantiating subsets of the variables,we can break loops in the graph.Unfortunately, when the cutset is large, this is very slow.By instantiating only a subset of values of the cutset, we can computelower bounds on the probabilities of interest.Alternatively, we can sample the cutsets jointly, a technique known as block Gibbs sampling. Parametric approximation methods.These express the intermediate summands in a simplerform, e.g., by approximating them as a product of smaller factors."Minibuckets" and the Boyen-Koller algorithm fall into this category.Approximate inference is a huge topic:see the references for more details.

Inference in DBNs

The general inference problem for DBNs is to computeP(X(i,t0) | y(:, t1:t2)), where X(i,t) represents the i"th hiddenvariable at time and t Y(:,t1:t2) represents all the evidencebetween times t1 and t2. (In fact, we often also want to compute joint distributions ofvariables over one or more time slcies.)There are several special cases of interest, illustrated below.The arrow indicates t0: it is X(t0) that we are trying to estimate.The shaded region denotes t1:t2, the available data.
Here is a simple example of inference in an LDS.Consider a particle moving in the plane atconstant velocity subject to random perturbations in its trajectory.The new position (x1, x2) is the old position plus the velocity (dx1,dx2) plus noise w.< x1(t) > = <1 0 1 0> < x1(t-1) > + < wx1 >< x2(t) > <0 1 0 1> < x2(t-1) > < wx2 >< dx1(t) > <0 0 1 0> < dx1(t-1) > < wdx1 >< dx2(t) > <0 0 0 1> < dx2(t-1) > < wdx2 >We assume we only observe the position of the particle.< y1(t) > = <1 0 0 0> < x1(t) > + < vx1 >< y2(t) > <0 1 0 0> < x2(t) > < vx2 > < dx1(t) > < dx2(t) >Suppose we start out at position (10,10) moving to the right withvelocity (1,0).We sampled a random trajectory of length 15.Below we show the filtered and smoothed trajectories.
The mean squared error of the filtered estimate is 4.9; for thesmoothed estimate it is 3.2.Not only is the smoothed estimate better, but we know thatit is better, as illustrated by the smaller uncertainty ellipses;this can help in e.g., data association problems.Note how the smoothed ellipses are larger at the ends, because thesepoints have seen less data. Also, note how rapidly the filteredellipses reach their steady-state (Ricatti) values.(See my Kalmanfilter toolbox for more details.)LEARNINGOne needs to specify two things to describe a BN: the graph topology(structure) and the parameters of each CPD. It is possible to learnboth of these from data. However, learning structure is much harderthan learning parameters. Also, learning when some of the nodes arehidden, or we have missing data, is much harder than when everythingis observed. This gives rise to 4 cases:Structure Observability Method---------------------------------------------Known Full Maximum Likelihood EstimationKnown Partial EM (or gradient ascent)Unknown Full Search through model space Unknown Partial EM + search through model space

Known structure, full observability

We assume that the goal of learning in this case is to find the valuesof the parameters of each CPD which maximizes the likelihood of thetraining data,which contains N cases (assumed to be independent).The normalized log-likelihood of the training set Dis a sum of terms, onefor each node:
We see that the log-likelihood scoring function decomposesaccording to the structure of the graph, and hence we can maximize thecontribution to the log-likelihood of each node independently(assuming the parameters in each node are independent of the other nodes).Consider estimating the Conditional Probability Table for the Wnode. If we have a set of training data, we can just count the numberof times the grass is wet when it is raining and the sprinler is on,N(W=1,S=1,R=1), the number of times the grass is wet when it israining and the sprinkler is off, N(W=1,S=0,R=1), etc. Given thesecounts (which are the sufficient, we can find the MaximumLikelihood Estimate of the CPT as follows:where the denominator is N(S=s,R=r) = N(W=0,S=s,R=r) + N(W=1,S=s,R=r).Thus "learning" just amounts to counting (in the case of multinomialdistributions).For Gaussian nodes, we can compute the sample mean and variance, anduse linear regression to estimate the weight matrix.For other kinds of distributions, more complex procedures arenecessary.As is well known from the HMM literature, ML estimates of CPTs areprone to sparse data problems, which can be solved by using (mixturesof) Dirichlet priors (pseudo counts).This results in a Maximum A Posteriori (MAP) estimate.For Gaussians, we can use a Wishart prior, etc.

Known structure, partial observability

When some of the nodes are hidden, we can use the EM (ExpectationMaximization) algorithm to find a (locally) optimal Maximum LikelihoodEstimate of the parameters.The basic idea behind EM is that, if we knew the values ofall the nodes, learning (the M step) would be easy, as we sawabove. So in the E step, we compute the expected values of all thenodes using an inference algorithm, and then treat these expectedvalues as though they were observed (distributions). For example, in the caseof the W node, we replace the observed counts of the events with the number of timeswe expect to see each event:P(W=w|S=s,R=r) = E N(W=w,S=s,R=r) / E N(S=s,R=r)whereE N(x) is the expected number of times event xoccurs in the whole training set, given the current guess of the parameters.These expected counts can be computed as followsE N(.) = E sum_k I(. | D(k)) = sum_k P(. | D(k))where I(x | D(k)) is an indicator function which is 1 if eventx occurs in training case k, and 0 otherwise.Given the expected counts, we maximize the parameters, and thenrecompute the expected counts, etc. This iterative procedure isguaranteed to converge to a local maximum of the likelihood surface.It is also possible to do gradient ascent on the likelihood surface (thegradient expression also involves the expected counts), but EM isusually faster (since it uses the natural gradient) and simpler (sinceit has no step size parameter andtakes care of parameter constraints (e.g., the "rows" of theCPT having to sum to one) automatically).In any case, we see than when nodes are hidden, inference becomes asubroutine which is called by the learning procedure; hence fastinference algorithms are crucial.

Unknown structure, full observability

We start by discussing the scoring function which we use to selectmodels; we then discuss algorithms which attempt to optimize thisfunction over the space of models, and finally examine their computational andsample complexity.

The objective function used for model selection

The maximum likelihood model will be acomplete graph, since this has the largest number of parameters, and hence can fit the data the best.A well-principled way to avoid this kind of over-fitting is to put a prior on models,specifying that we prefer sparse models.Then, by Bayes" rule, the MAP model is the one that maximizes
Taking logs, we find
where c = - \log \Pr(D) is a constant independent of G.The effect of the structure prior P(G) is equivalent to penalizing overly complexmodels.However, this is not strictly necessary, since the marginal likelihoodtermP(D|G) = \int_{\theta} P(D|G, \theta)has a similar effect of penalizing models with too many parameters(this is known as Occam"s razor).

Search algorithms for finding the best model

The goal of structure learning is to learn a dag (directed acyclicgraph) that best explains the data. This is an NP-hard problem, sincethe number of dag"s on N variables is super-exponential in N. (There is no closedform formula for this, but to give you an idea,there are 543 dags on 4 nodes, and O(10^18) dags on 10 nodes.)If we know the ordering of the nodes, life becomes much simpler,since we can learn the parent set foreach node independently(since the score is decomposable), and we don"t need to worry about acyclicity constraints.For each node, there at most\sum_{k=0}^n \choice{n}{k} = 2^nsets of possible parents for each node, which can bearranged in a lattice as shown below for n=4.The problem is to find the highest scoring point in this lattice.
There are three obvious ways to search this graph: bottom up, topdown, or middle out.In the bottom up approach, we start at the bottom of thelattice, and evaluate the score at all points in each successivelevel.We must decide whether the gains in score produced by alarger parent set is ``worth it"".The standard approach in the reconstructibility analysis (RA) community uses the factthat \chi^2(X,Y) \approx I(X,Y) N \ln(4), where N is the number ofsamples and I(X,Y) is the mutual information (MI) between X and Y.Hence we can use a \chi^2 test to decidewhether an increase in the MI score is statistically significant.(This also gives us some kind of confidence measure on the connectionsthat we learn.)Alternatively, we can use a BIC score.Of course, if we do not know if we have achieved the maximum possiblescore, we do not know when to stop searching, and hence we mustevaluate all points in the lattice (although we can obviously usebranch-and-bound). For large n, this iscomputationally infeasible, so a common approach is to only search upuntil level K (i.e., assume a bound on the maximum number of parentsof each node), which takes O(n ^ K) time.The obvious way to avoid the exponential cost (and the need for abound, K) is to use toavoid examining all possible subsets.(In fact, we must use of some kind, since the problem oflearning optimal structure is NP-hard \cite{Chickering95}.)One approach in the RA framework, called Extended Dependency Analysis (EDA)\cite{Conant88}, is as follows.Start by evaluating all subsets of size up to two, keep all the ones withsignificant (in the \chi^2 sense) MI with the target node, and take the union of the resulting setas the set of parents.The disadvantage of this greedy technique is that it will fail to find a setof parents unless some subset of size two has significant MI with thetarget variable. However, a Monte Carlosimulation in \cite{Conant88} shows that most random relations havethis property.In addition, highly interdependent sets of parents (which mightfail the pairwise MI test) violate the causalindependence assumption, which is necessary to justify the use ofnoisy-OR and similar CPDs.An alternative technique, popular in the UAI community, is to startwith an initial guess of the model structure (i.e., at a specificpoint in the lattice), and then perform localsearch, i.e., evaluate the score of neighboring points in the lattice,and move to the best such point, until we reach a local optimum. Wecan use multiple restarts to try to find the global optimum, and tolearn an ensemble of models.Note that, in the partially observable case, we need to have aninitial guess of the model structure in order to estimate the valuesof the hidden nodes, and hence the (expected) score of each model; starting with the fullydisconnected model (i.e., at the bottom of the lattice) would be a badidea, since it would lead to a poor estimate.

Unknown structure, partial observability

Finally, we come to the hardest case of all, where the structure isunknown and there are hidden variables and/or missing data.In this case, to compute the Bayesian score, we must marginalize outthe hidden nodes as well as the parameters.Since this is usually intractable, it is common to usean asymptoticapproximation to the posterior called BIC (Bayesian InformationCriterion), which is defined as follows:\log \Pr(D|G) \approx \log \Pr(D|G, \hat{\Theta}_G) - \frac{\log N}{2} \#Gwhere N is the number of samples,\hat{\Theta}_G is the ML estimate of the parameters,and#G is the dimension of the model.(In the fully observable case, the dimension of a model is the numberof free parameters. In a model with hidden variables, it might be lessthan this.)The first term is just the likelihood andthe second term is a penalty for model complexity.(The BIC score is identical to the Minimum Description Length (MDL)score.)Although the BIC score decomposes into a sum of local terms, one pernode,local search is still expensive, because we need to run EM at eachstep to compute \hat{\Theta}. An alternative approach is to do thelocal search steps inside of the M step of EM - this is calledStructureal EM, and provably converges to a local maximum of the BIonaga.vncore (Friedman, 1997).

Inventing new hidden nodes

So far, structure learning has meant finding the right connectivitybetween pre-existing nodes. A more interesting problem is inventinghidden nodes on demand. Hidden nodes can make a model much morecompact, as we see below.
(a) A BN with a hidden variable H. (b) The simplest networkthat can capture the same distribution without using a hidden variable(created using arc reversal and node elimination).If H is binary and the other nodes are trinary, and we assume fullCPTs, the first network has 45 independent parameters, and the secondhas 708.The standard approach is to keep addinghidden nodes one at a time, to some part of the network (see below),performing structure learning at eachstep, until the score drops.One problem is choosing the cardinality (number of possible values)for the hidden node, and its type of CPD.Another problem ischoosing where to add the new hidden node.There is no point making it a child, since hidden children can alwaysbe marginalized away, so we need to find an existing node which needsa new parent, when the current set of possible parents is not adequate.\cite{Ramachandran98} use the following heuristic for finding nodeswhich need new parents: they considera noisy-OR node which is nearly always on, even if its non-leakparents are off, as anindicator that there is a missing parent.Generalizing this technique beyond noisy-ORs is an interesting openproblem.One approach might be to examineH(X|Pa(X)):if this is very high, it means the current set of parents areinadequate to ``explain"" the residual entropy; if Pa(X) is thebest (in the BIC or \chi^2 sense) set of parents we have been ableto find in the current model, itsuggests we need to create a new node and add it to Pa(X).A simple heuristic for inventing hidden nodes in the case of DBNs isto check if the Markov property is being violated for any particularnode. If so, it suggests that we need connections to slices furtherback in time. Equivalently, we can add new lag variablesand connect to them.Of course,interpreting the ``meaning"" of hidden nodes is alwaystricky, especially since they are often unidentifiable,e.g., we can often switch the interpretation of the true and false states(assuming for simplicity that the hidden node is binary) provided wealso permute the parameters appropriately. (Symmetries such as this areone cause of the multiple maxima in the likelihood surface.)

Further reading on learning

The following are good tutorial articles. W. L. Buntine, 1994."Operations for Learning with Graphical Models",J. AI Research, 159--225. D. Heckerman, 1996."A tutorial on learning with Bayesian networks",Microsoft Research tech. report, MSR-TR-95-06.Decision TheoryIt is often said that "Decision Theory = Probability Theory + UtilityTheory".We have outlined above how we can model joint probability distributions in acompact way by using sparse graphs to reflect conditional independencerelationships.It is also possible to decompose multi-attribute utility functions ina similar way:we create a node for each term in the sum, whichhas as parents all the attributes (randomvariables) on which it depends; typically, the utility node(s) willhave action node(s) as parents, since the utility depends both on thestate of the world and the action we perform.The resulting graph is called an influence diagram.In principle, we can then use the influence diagram to computethe optimal (sequence of) action(s) to perform so as to maximimizeexpected utility, although this is computationally intractible for allbut the smallest problems.Classical control theory is mostly concerned with the special casewhere the graphical model is aLinear Dynamical Systemand the utility function is negative quadratic loss, e.g., consider amissile tracking an airplane: its goal is to minimize the squareddistance between itself and the target. When the utility functionand/or the system model becomes more complicated, traditional methodsbreak down, and one has to use reinforcement learning to find theoptimal policy (mapping from states to actions).ApplicationsThe most widely used Bayes Nets are undoubtedly the ones embedded inMicrosoft"s products, including the AnswerWizard of Office 95, the Office Assistant (the bouncy paperclip guy) ofOffice 97, and over 30 Technical Support Troubleshooters.BNs originally arose out of an attempt to add probabilities toexpert systems, and this is still the most common use for BNs.A famous example isQMR-DT, a decision-theoretic reformulation of the Quick MedicalReference (QMR) model.Here, the top layer represents hidden disease nodes, and the bottomlayer represents observed symptom nodes.The goal is to infer the posterior probability of each disease givenall the symptoms (which can be present, absent or unknown).QMR-DT is so densely connected that exact inference is impossible. Various approximationmethods have been used, including sampling, variational and loopybelief propagation.Another interesting fielded application is theVista system, developedby Eric Horvitz.The Vista system is a decision-theoretic system that has been used atNASA Mission Control Center in Houston for several years. The systemuses Bayesian networks to interpret live telemetry and provides adviceon the likelihood of alternative failures of the space shuttle"spropulsion systems. It also considers time criticality and recommendsactions of the highest expected utility. The Vista system also employsdecision-theoretic methods for controlling the display of informationto dynamically identify the most important information to highlight.Horvitz has gone on to attempt to apply similar technology toMicrosoft products, e.g., the Lumiere project.Special cases of BNs were independently invented by many differentcommunities, for use in e.g., (linkage analysis), speechrecognition (HMMs), tracking (Kalman fitering), data compression(density estimation)and coding (turbocodes), etc.For examples of other applications, see thespecial issue of Proc. ACM 38(3), 1995,and the MicrosoftDecision Theory Group page.

Applications to biology

This is one of the hottest areas.For a review, seeInferring cellular networks using probabilistic graphical modelsScience, Nir Friedman, v303 p799, 6 Feb 2004.Recommended introductory reading


In reverse chronological order. Daphne Koller and Nir Friedman,"Probabilistic graphical models: principles and techniques",MIT Press 2009 Adnan Darwiche,"Modeling and reasoning with Bayesian networks",Cambridge 2009F. V. Jensen."Bayesian Networks and Decision Graphs".Springer.2001.Probably the best introductory book available.D. Edwards."Introduction to Graphical Modelling", 2nd ed.Springer-Verlag.2000.Good treatment of undirected graphical models from a statistical perspective. J. Pearl."Causality".Cambridge.2000.The definitive book on using causal DAG modeling. R. G. Cowell, A. P. Dawid, S. L. Lauritzen andD. J. Spiegelhalter."Probabilistic Networks and Expert Systems".Springer-Verlag.1999.Probably the best book available,although the treatment is restricted to exact inference. M. I. Jordan (ed)."Learning in Graphical Models".MIT Press.1998.Loose collection of papers on machine learning, many related tographical models.One of the few books to discuss approximate inference. B. Frey."Graphical models for machine learning and digital communication",MIT Press.1998.Discusses pattern recognition and turbocodes using (directed)graphical models. E. Castillo and J. M. Gutierrez and A. S. Hadi."Expert systems and probabilistic network models".Springer-Verlag, 1997.A Spanishversion is available online for free. F. Jensen."An introduction to Bayesian Networks".UCL Press.1996.Out of print.Superceded by his 2001 book. S. Lauritzen."Graphical Models",Oxford.1996.The definitive mathematical exposition of the theory of graphicalmodels. S. Russell and P. Norvig."Artificial Intelligence: A Modern Approach".Prentice Hall.1995. Popular undergraduate textbook that includes a readable chapter ondirected graphical models. J. Whittaker."Graphical Models in Applied Multivariate",Wiley.1990.This is the first book published on graphical modelling from a statistionaga.vnperspective. R. Neapoliton."Probabilistic Reasoning in Expert Systems".John Wiley & Sons.1990. J. Pearl."Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference."Morgan Kaufmann.1988. The book that got it all started!A very insightful book, still relevant today.

Review articles

P. Smyth, 1998."Belief networks, hidden Markov models, and Markovrandom fields: a unifying view",Pattern Recognition Letters. E. Charniak, 1991."Bayesian Networks without Tears", AI magazine.Sam Roweis & Zoubin Ghahramani, 1999.A Unifying Review of Linear Gaussian Models,Neural Computation 11(2) (1999) pp.305-345

Exact Inference

C. Huang and A. Darwiche, 1996."Inference in Belief Networks: A procedural guide",Intl. J. Approximate Reasoning, 15(3):225-263. R. McEliece and S. M. Aji, 2000.The Generalized Distributive Law,IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000),pp. 325--343. F. Kschischang, B. Frey and H. Loeliger, 2001.Factorgraphs and the sum product algorithm,IEEE Transactions on Information Theory, February, 2001. M. Peot and R. Shachter, 1991."Fusion and propogation with multiple observations in belief networks",Artificial Intelligence, 48:299-318.

Approximate Inference

M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, 1997."An introduction to variational methods for graphical models." D. MacKay, 1998."An introduction to Monte Carlo methods". T. Jaakkola and M. Jordan, 1998."Variational probabilistic inference and the QMR-DT database"


W. L. Buntine, 1994."Operations for Learning with Graphical Models",J. AI Research, 159--225. D. Heckerman, 1996."A tutorial on learning with Bayesian networks",Microsoft Research tech. report, MSR-TR-95-06.

Xem thêm: Đau Các Khớp Ngón Tay Là Bệnh Gì ? Xem Ngay! Các Nguyên Nhân Gây Ra Đau Khớp Ngón Tay


L. R. Rabiner, 1989."A Tutorial in Hidden Markov Models and Selected Applications in Speech Recognition",Proc. of the IEEE, 77(2):257--286. Z. Ghahramani, 1998. Learning Dynamic Bayesian Networks In C.L. Giles and M. Gori (eds.), Adaptive Processing of Sequences and Data Structures . Lecture Notes in Artificial Intelligence, 168-197. Berlin: Springer-Verlag.