# BAYESIAN NETWORK LÀ GÌ

-
By Kevin Murphy, 1998.

Bạn đang xem: Bayesian network là gì

"Graphical models are a marriage between probability theory andgraph theory. They provide a natural tool for dealing with two problemsthat occur throughout applied mathemationaga.vn and engineering --uncertainty và complexity -- & in particular they are playing anincreasingly important role in the design & analysis of machinelearning algorithms. Fundamental to lớn the idea of a graphical Mã Sản Phẩm isthe notion of modularity -- a complex system is built by combiningsimpler parts. Probability theory provides the glue whereby the partsare combined, ensuring that the system as a whole is consistent, andproviding ways to interface models to data. The graph theoretic sideof graphical models provides both an intuitively appealing interfaceby which humans can Mã Sản Phẩm highly-interacting sets of variables as wellas a data structure that lends itself naturally to the kiến thiết ofefficient general-purpose algorithms.Many of the classical multivariate probabalistic systems studied infields such as statistionaga.vn, systems engineering, information theory,pattern recognition & statistical mechanionaga.vn are special cases of thegeneral graphical model formalism -- examples include mixture models,factor analysis, hidden Markov models, Kalman filters and Isingmodels. The graphical Model framework provides a way to lớn view all ofthese systems as instances of a comtháng underlying formalism. This viewhas many advantages -- in particular, specialized techniques that havebeen developed in one field can be transferred between researchcommunities & exploited more widely. Moreover, the graphical modelformalism provides a natural framework for the thiết kế of new systems."--- Michael Jordan, 1998.

## This tutorial

We will briefly discuss the following topionaga.vn. Representation, or, what exactly is agraphical model? Inference, or, how can we use these modelsto efficiently answer probabilistic queries? Learning, or, what bởi we do if we don"t know what theModel is? Decision theory, or, what happens when itis time to lớn convert beliefs inkhổng lồ actions? Applications, or, what"s this all good for, anyway?Note: (a version of) this page is availablein pdf formathere.Also, Marie Stefanova has madea Swedish translationhere.

## Other sources of technical information

My tutorial on Bayes rule AUAI homepage (Association for Uncertainty in Artificial Intelligence) The UAImailing menu UAI proceedings. My danh sách of recommended reading . Bayes Net softwarepackages My Bayes Net Toolbox forMatlab Tutorial slides on graphical models& BNT, presented khổng lồ the Mathworks, May 2003 Listof other Bayes net tutorials RepresentationProbabilistic graphical models are graphs in which nodes representrandom variables, và the (lachồng of) aronaga.vn represent conditionalindependence assumptions.Hence they provide a compact representation of joint probabilitydistributions.Undirected graphical models, also called MarkovRandom Fields (MRFs) or Markov networks,have a simple definition of independence: two (sets of) nodes A và B are conditionally independent given a third set, C,if all paths between the nodes in A and B are separated by a node in C.By contrast, directed graphical modelsalso called Bayesian Networks or Belief Networks (BNs), have a morecomplicated notion of independence,which takes inkhổng lồ account the directionality of thearonaga.vn, as we explain below.Undirected graphical models are more popular with the physionaga.vn andvision communities, & directed models are more popular with the AIvà statistionaga.vn communities. (It is possible to have a mã sản phẩm with bothdirected and undirected aronaga.vn, which is called a chain graph.)For a careful study of the relationship between directed andundirected graphical models, see the books by Pearl88, Whittaker90,and Lauritzen96.Although directed models have a more complicated notion ofindependence than undirected models,they do have sầu several advantages.The most important is thatone can regard an arc from A to lớn B asindicating that A ``causes"" B. (See the discussion on causality.)This can be used as a guide to lớn construct the graph structure.In addition, directed models can encode deterministicrelationships, và are easier to learn (fit to lớn data).In the rest of this tutorial, we will only discuss directed graphicalmodels, i.e., Bayesian networks.In addition to the graph structure, it is necessary to specify theparameters of the Mã Sản Phẩm.For a directed mã sản phẩm, we must specifythe Conditional Probability Distribution (CPD) at each node.If the variables are discrete, this can be represented as a table(CPT), which lists the probability that the child node takes on eachof its different values for each combination of values of itsparents. Consider the following example, in which all nodes are binary,i.e., have sầu two possible values, which we will denote by T (true) andF (false).We see that the sự kiện "grass is wet" (W=true) has two possible causes: either the water sprinker is on (S=true) or it israining (R=true).The strength of this relationship is shown in the table.For example, we see that Pr(W=true | S=true, R=false) = 0.9 (secondrow), andhence, Pr(W=false | S=true, R=false) = 1 - 0.9 = 0.1, since each rowmust sum khổng lồ one.Since the C node has no parents, its CPT specifies the priorprobability that it is cloudy (in this case, 0.5).(Think of C as representing the season:if it is a cloudy season, it is less likely that the sprinkler is onvà more likely that the rain is on.)The simplest conditional independence relationship encoded in a Bayesiannetwork can be stated as follows:a node is independent of its ancestors given its parents, where theancestor/parent relationship is with respect lớn some fixed topologicalordering of the nodes.By the chain rule of probability,the joint probability of all the nodes in the graph above sầu isP(C, S, R, W) = P(C) * P(S|C) * P(R|C,S) * P(W|C,S,R)By using conditional independence relationships, we can rewrite this asP(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R)where we were allowed to simplify the third term because R isindependent of S given its parent C, and the last term because W isindependent of C given its parents S và R.We can see that the conditional independence relationshipsallow us lớn represent the joint more compactly.Here the savings are minimal, but in general, if we had n binarynodes, the full joint would require O(2^n) space khổng lồ represent, but thefactored khung would require O(n 2^k) space to represent, where k isthe maximum fan-in of a node. And fewer parameters makes learning easier.

### Are "Bayesian networks" Bayesian?

Despite the name,Bayesian networks vày not necessarily imply a commitment khổng lồ Bayesianstatistionaga.vn. Indeed, it is common khổng lồ use frequentists methods lớn estimate the parameters of the CPDs.Rather, they are so called because they useBayes" rule forprobabilistic inference, as we explain below.(The term "directed graphical model" is perhaps more appropriate.)Nevetherless, Bayes nets are a useful representation for hierarchicalBayesian models, which khung the foundation of applied Bayesianstatistionaga.vn(see e.g., the BUGS project).In such a Model, the parameters are treated lượt thích any other randomvariable, & becomes nodes in the graph.

## Inference

The most comtháng task we wish lớn solve sầu using Bayesian networks isprobabilistic inference. For example, consider the water sprinklernetwork, and suppose we observe thefact that the grass is wet. There are two possible causes for this:either it is raining, or the sprinkler is on. Which is more likely?We can use Bayes" rule khổng lồ compute the posterior probability of eachexplanation (where 0==false và 1==true).whereis a normalizing constant, equal khổng lồ the probability (likelihood) ofthe data.So we see that it is more likely that the grass is wet becauseit is raining:the likelihood ratio is 0.7079/0.4298 = 1.647.

## Explaining away

In the above sầu example, notice that the two causes "compete" khổng lồ "explain" theobserved data. Hence S & R become conditionally dependent given thattheir common child, W, is observed, even though they are marginallyindependent. For example,suppose the grass is wet, but that we also know that it is raining.Then the posterior probability that the sprinkler is on goes down:Pr(S=1|W=1,R=1) = 0.1945This is called "explaining away".In statistionaga.vn, this is known as Berkson"s paradox, or "selectionbias". For a dramatic example of this effect, consider a college whichadmits students who are either brainy or sporty (or both!).Let C denote the sự kiện that someone is admitted to college, which ismade true if they are either brainy (B) or sporty (S).Suppose in the general population, B & S are independent.We can mã sản phẩm our conditional independence assumptions using a graphwhich is a V structure, with arrows pointing down: B S / v CNow look at a population of college students (those for which C isobserved to lớn be true).It will be found that being brainy makes you less likely khổng lồ be sportyvà vice versa, because either property alone is sufficient lớn explainthe evidence on C(i.e., P(S=1 | C=1, B=1) try this little BNT demo!)

## Top-down and bottom-up reasoning

In the water sprinkler example, we had evidence of an effect (wet grass), andinferred the most likely cause. This is called diagnostic, or "bottomup", reasoning, since it goesfrom effects to lớn causes; it is a common task in expert systems.Bayes nets can also be used for causal, or "top down",reasoning. For example, we can compute the probability that the grasswill be wet given that it is cloudy.Hence Bayes nets are often called "generative" models, because theyspecify how causes generate effects.

## Causality

One of the most exciting things about Bayes nets is that they can beused to put discussions about causality on a solid mathematical basis.One very interesting question is: can we distinguish causation from merecorrelation? The answer is "sometimes",but you need to lớn measure the relationships between at leastthree variables; the intution is that one of the variables actsas a "virtual control" for the relationship between the other two,so we don"t always need lớn do experiments lớn infer causality.See the following books for details. "Causality: Models,Reasoning and Inference",Judea Pearl, 2000, Cambridge University Press. "Causation,Prediction và Search", Spirtes, Glymour và Scheines, 2001 (2nd edition), MIT Press. "Causevà Correlation in Biology", Bill Shipley, 2000,Cambridge University Press. "Computation, Causation and Discovery", Glymour & Cooper (eds),1999, MIT Press.

## Conditional independence in Bayes Nets

In general,the conditional independence relationships encoded by a Bayes Netare best be explained by means of the "Bayes Ball"algorithm (due to Ross Shachter), which is as follows:Two (sets of) nodes A và B are conditionally independent(d-separated) given a set Cif and only if there is noway for a ball lớn get from A to B in the graph, where the allowablemovements of the ball are shown below.Hidden nodes are nodes whose valuesare not known, và are depicted as unshaded; observed nodes (the oneswe condition on) are shaded.The dotted aronaga.vn indicate direction of flow of the ball.The most interesting case is the first column, when we have two arrows converging on anode X (so X is a "leaf" with two parents).If X is hidden, its parents are marginally independent, và hence theball does not pass through (the ball being "turned around" isindicated by the curved arrows); but if X is observed, the parents becomedependent, & the ball does pass through,because of the explaining away phenomenon.Notice that, if this graph was undirected, the childwould always separate the parents; hence when converting a directedgraph to an undirected graph, we must add liên kết between "unmarried"parents who mô tả a comtháng child (i.e., "moralize" the graph) khổng lồ prevent us reading off incorrectindependence statements.Now consider the second column in which we have two diverging arrows from X (so X is a"root").If X is hidden,the children are dependent, because they have sầu a hidden common cause,so the ball passes through.If X is observed, its children are rendered conditionallyindependent, so the ball does not pass through.Finally, consider the case in which we have one incoming và outgoingarrow to lớn X. It is intuitive sầu that the nodes upstreamvà downstream of X are dependent iff X is hidden, becauseconditioning on a node breaks the graph at that point.

## Bayes nets with discrete và continuous nodes

The introductory example used nodes with categorical values & multinomial distributions.It is also possible khổng lồ create Bayesian networks with continuous valued nodes.The most comtháng distribution for such variables is the Gaussian.For discrete nodes with continuous parents, we can use thelogistic/softmax distribution.Using multinomials, conditional Gaussians, & the softmaxdistribution, we can have sầu a rich toolbox for making complex models.Some examples are shown below. For details, clickhere. (Circles denote continuous-valued random variables,squares denote discrete rv"s, clearmeans hidden, and shaded means observed.)
For more details, see this excellent paper. A Unifying reviews of Linear Gaussian Models,Sam Roweis và Zoubin Ghahramani.Neural Computation 11(2) (1999) pp.305-345

## Temporal models

Dynamic Bayesian Networks (DBNs) are directed graphical models of stochasticprocesses.They generalise hidden Markov models (HMMs)& linear dynamical systems (LDSs)by representing the hidden (& observed) state in terms of statevariables, which can have complex interdependencies.The graphical structure provides an easy way khổng lồ specify theseconditional independencies, & hence khổng lồ provide a compactparameterization of the model.lưu ý that "temporal Bayesian network" would be a better name than"dynamic Bayesian network", sinceit is assumed that the mã sản phẩm structure does not change, butthe term DBN has become entrenched.We also normally assume that the parameters do notchange, i.e., the Mã Sản Phẩm is time-invariant.However, we can always add extrahidden nodes to represent the current "regime", thereby creatingmixtures of models to capture periodic non-stationarities.There are some cases where the size of the state space can change overtime, e.g., tracking a variable, but unknown, number of objects.In this case, we need lớn change the mã sản phẩm structure over time.

### Hidden Markov Models (HMMs)

The simplest kind of DBN is a Hidden Markov Model (HMM), which hasone discrete hidden node and one discrete or continuousobserved node per slice. We illustrate this below.As before, circles denote continuous nodes, squares denotediscrete nodes, clear means hidden, shaded means observed.
We have sầu "unrolled" the Mã Sản Phẩm for 4 "time slices" -- the structure & parameters areassumed khổng lồ repeat as the Model is unrolled further.Hence khổng lồ specify a DBN, we need todefine the intra-slice topology (within a slice),the inter-slice topology (between two slices),as well as the parameters for the first two slices.(Such a two-slice temporal Bayes net is often called a 2TBN.)Some comtháng variants on HMMs are shown below.

### Linear Dynamical Systems (LDSs) & Kalman filters

A Linear Dynamical System (LDS) has the same topology as an HMM, butall the nodes are assumed lớn have linear-Gaussian distributions, i.e., x(t+1) = A*x(t) + w(t), w ~ N(0, Q), x(0) ~ N(init_x, init_V) y(t) = C*x(t) + v(t), v ~ N(0, R)The Kalman filter is a way of doing online filtering in this Mã Sản Phẩm.Some simple variants of LDSs are shown below.
The Kalman filter has been proposed as a mã sản phẩm for how the brainintegrates visual cues over time khổng lồ infer the state of the world,although the reality is obviously much more complicated.The main point is not that the Kalman filter is the right Mã Sản Phẩm, but thatthe brain is combining bottom up and top down cues.The figure below is from a paper called"A Kalman Filter Model of the Visual Cortex",by P. Rao, Neural Computation 9(4):721--763, 1997.

### More complex DBNs

It is also possible khổng lồ create temporal models with much more complicatedtopologies, such as the Bayesian Automated Taxi (BAT) network shownbelow.(For simplithành phố, we only show the observed leaves for slice 2.Thanks khổng lồ Daphne Koller for providing this figure.)
When some of the observed nodes are thought of as inputs (actions), and some asoutputs (percepts), the DBN becomes a POMDP.See also the section on decision theory below.

## A generative model for generative models

The figure below, produced by Zoubin Ghahramani và Sam Roweis, is agood summary of the relationships between some popular graphical models.
INFERENCEA graphical model specifies a completejoint probability distribution (JPD) over all the variables.Given the JPD, we can answer all possible inferencequeries by marginalization (summing out over irrelevant variables), asillustrated in the introduction. However, the JPD has kích thước O(2^n), where n is thenumber of nodes, và we have assumed each node can have sầu 2states. Hence summing over the JPD takes exponential time.We now discuss more efficient methods.

## Variable elimination

For a directed graphical Mã Sản Phẩm (Bayes net), we can sometimes use the factored representation of the JPDlớn bởi vì marginalisation efficiently.The key idea is to "push sums in" as far as possible when summing(marginalizing) out irrelevant terms, e.g., for the water sprinkler networkNotice that, as we perform the innermost sums, we create new terms,which need khổng lồ be summed over in turn e.g.,whereContinuing in this way,whereThis algorithm is called Variable Elimination.The principle of distributing sums over products can be generalizedgreatly lớn apply khổng lồ any commutative semiring.This forms the basis of many common algorithms, such as Viterbidecoding và the Fast Fourier Transkhung. For details, see R. McEliece & S. M. Aji, 2000.The Generalized Distributive Law,IEEE Trans. Inkhung. Theory, vol. 46, no. 2 (March 2000),pp. 325--343. F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001.Factor graphs và the sum-hàng hóa algorithmIEEE Transactions on Information Theory, February, 2001.The amount of work we perform when computing a marginal is bounded bythe kích cỡ of the largest term that we encounter. Choosing a summation(elimination) ordering tominimize this is NP-hard, although greedy algorithms work well inpractice.

Xem thêm: Card Intel Hd Graphics Chơi Được Game Gì ? Đồ Họa Ra Sao?Acup

## Dynamic programming

If we wish lớn compute several marginals at the same time, we can use DynamicProgramming (DP) to lớn avoid the redundant computation that would be involvedif we used variable elimination repeatedly.If the underlying undirected graph of the BN is acyclic (i.e., a tree), we can use alocal message passing algorithm due to Pearl.This is a generalization of the well-known forwards-backwardsalgorithm for HMMs (chains).For details, see "Probabilistic Reasoning in Intelligent Systems", Judea Pearl,1988, 2nd ed. "Fusion and propogation with multiple observations in belief networks", Peot và Shachter, AI 48 (1991) p. 299-318.If the BN has undirected cycles (as in the water sprinkler example), local message passing algorithms run the risk of double counting.e.g., the information from S & R flowinginto W is not independent, because it came from a comtháng cause, C.The most comtháng approach is therefore khổng lồ convert the BN into lớn a tree,by clustering nodes together, to lớn form what is called ajunction tree, & then running a local message passing algorithm onthis tree. The message passing scheme could be Pearl"s algorithm, butit is more common to use a variant designed for undirected models.For more details, cliông chồng hereThe running time of the DP. algorithms is exponential in the kích thước ofthe largest cluster (these clusters correspond lớn the intermediateterms created by variable elimination). This form size is called theinduced width of the graph. Minimizing this is NP-hard.

## Approximation algorithms

Many models of interest,such as those with repetitive sầu structure, as inmultivariate time-series or image analysis,have sầu large induced width, which makes exactinference very slow.We must therefore resort to lớn approximation techniques.Unfortunately, approximate inference is #P-hard, but we can nonetheless come upwith approximations which often work well in practice. Below is a listof the major techniques. Variational methods.The simplest example is the mean-field approximation,which exploits the law oflarge numbers khổng lồ approximate large sums of random variables by theirmeans. In particular, we essentially decouple all the nodes, andintroduce a new parameter, called a variational parameter, for eachnode, & iteratively update these parameters so as to minimize thecross-entropy (KL distance) between the approximate & trueprobability distributions. Updating the variational parameters becomes a proxy forinference. The mean-field approximation produces a lower bound on thelikelihood. More sophisticated methods are possible, which givetighter lower (and upper) bounds. Sampling (Monte Carlo) methods. The simplest kind is importancesampling, where we draw random samples x from P(X), the (unconditional)distribution on the hidden variables, andthen weight the samples by their likelihood, P(y|x), where y is theevidence. A more efficient approach in high dimensions is called MonteCarlo Markov Chain (MCMC), & includes as special cases Gibbs sampling and the Metropolis-Hasting algorithm. "Loopy belief propogation". This entailsapplying Pearl"s algorithm to the originalgraph, even if it has loops (undirected cycles).In theory, this runs the risk of double counting, but Yair Weiss andothers have proved that in certain cases (e.g., a single loop), events are double counted"equally", and hence "cancel" lớn give the right answer.Belief propagation is equivalent lớn exact inference on a modifiedgraph, called the universal cover or unwrapped/ computation tree,which has the same local topology as the original graph.This is the same as the Bethe & cavity/TAP.. approaches in statisticalphysionaga.vn.Hence there is a deep connection betweenbelief propagation and variational methods that people are currently investigating. Bounded cutphối conditioning. By instantiating subsets of the variables,we can break loops in the graph.Unfortunately, when the cutphối is large, this is very slow.By instantiating only a subphối of values of the cutphối, we can computelower bounds on the probabilities of interest.Alternatively, we can sample the cutsets jointly, a technique known as block Gibbs sampling. Parametric approximation methods.These express the intermediate summands in a simplerkhung, e.g., by approximating them as a hàng hóa of smaller factors."Minibuckets" và the Boyen-Koller algorithm fall inlớn this category.Approximate inference is a huge topic:see the references for more details.

## Inference in DBNs

The general inference problem for DBNs is khổng lồ computeP(X(i,t0) | y(:, t1:t2)), where X(i,t) represents the i"th hiddenvariable at time và t Y(:,t1:t2) represents all the evidencebetween times t1 và t2. (In fact, we often also want to compute joint distributions ofvariables over one or more time slcies.)There are several special cases of interest, illustrated below.The arrow indicates t0: it is X(t0) that we are trying lớn estimate.The shaded region denotes t1:t2, the available data.
Here is a simple example of inference in an LDS.Consider a particle moving in the plane atconstant velocity subject lớn random perturbations in its trajectory.The new position (x1, x2) is the old position plus the velocity (dx1,dx2) plus noise w.< x1(t) > = <1 0 1 0> < x1(t-1) > + < wx1 >< x2(t) > <0 1 0 1> < x2(t-1) > < wx2 >< dx1(t) > <0 0 1 0> < dx1(t-1) > < wdx1 >< dx2(t) > <0 0 0 1> < dx2(t-1) > < wdx2 >We assume we only observe the position of the particle.< y1(t) > = <1 0 0 0> < x1(t) > + < vx1 >< y2(t) > <0 1 0 0> < x2(t) > < vx2 > < dx1(t) > < dx2(t) >Suppose we start out at position (10,10) moving khổng lồ the right withvelođô thị (1,0).We sampled a random trajectory of length 15.Below we show the filtered & smoothed trajectories.
The mean squared error of the filtered estimate is 4.9; for thesmoothed estimate it is 3.2.Not only is the smoothed estimate better, but we know thatit is better, as illustrated by the smaller uncertainty ellipses;this can help in e.g., data association problems.Note how the smoothed ellipses are larger at the ends, because thesepoints have sầu seen less data. Also, note how rapidly the filteredellipses reach their steady-state (Ricatti) values.(See my Kalmanfilter toolbox for more details.)LEARNINGOne needs khổng lồ specify two things khổng lồ describe a BN: the graph topology(structure) & the parameters of each CPD. It is possible khổng lồ learnboth of these from data. However, learning structure is much harderthan learning parameters. Also, learning when some of the nodes arehidden, or we have sầu missing data, is much harder than when everythingis observed. This gives rise to 4 cases:Structure Observability Method---------------------------------------------Known Full Maximum Likelihood EstimationKnown Partial EM (or gradient ascent)Unknown Full Search through Model space Unknown Partial EM + tìm kiếm through mã sản phẩm space

## Known structure, full observability

We assume that the goal of learning in this case is khổng lồ find the valuesof the parameters of each CPD which maximizes the likelihood of thetraining data,which contains N cases (assumed lớn be independent).The normalized log-likelihood of the training set Dis a sum of terms, onefor each node:
We see that the log-likelihood scoring function decomposesaccording to the structure of the graph, và hence we can maximize thecontribution to the log-likelihood of each node independently(assuming the parameters in each node are independent of the other nodes).Consider estimating the Conditional Probability Table for the Wnode. If we have sầu a mix of training data, we can just count the numberof times the grass is wet when it is raining & the sprinler is on,N(W=1,S=1,R=1), the number of times the grass is wet when it israining và the sprinkler is off, N(W=1,S=0,R=1), etc. Given thesecounts (which are the sufficient statistionaga.vn), we can find the MaximumLikelihood Estimate of the CPT as follows:where the denominator is N(S=s,R=r) = N(W=0,S=s,R=r) + N(W=1,S=s,R=r).Thus "learning" just amounts to lớn counting (in the case of multinomialdistributions).For Gaussian nodes, we can compute the sample mean và variance, anduse linear regression to lớn estimate the weight matrix.For other kinds of distributions, more complex procedures arenecessary.As is well known from the HMM literature, ML estimates of CPTs areprone lớn sparse data problems, which can be solved by using (mixturesof) Dirichlet priors (pseuvì chưng counts).This results in a Maximum A Posteriori (MAP) estimate.For Gaussians, we can use a Wishart prior, etc.

## Known structure, partial observability

When some of the nodes are hidden, we can use the EM (ExpectationMaximization) algorithm to find a (locally) optimal Maximum LikelihoodEstimate of the parameters.The basic idea behind EM is that, if we knew the values ofall the nodes, learning (the M step) would be easy, as we sawabove. So in the E step, we compute the expected values of all thenodes using an inference algorithm, và then treat these expectedvalues as though they were observed (distributions). For example, in the caseof the W node, we replace the observed counts of the events with the number of timeswe expect lớn see each event:P(W=w|S=s,R=r) = E N(W=w,S=s,R=r) / E N(S=s,R=r)whereE N(x) is the expected number of times event xoccurs in the whole training set, given the current guess of the parameters.These expected counts can be computed as followsE N(.) = E sum_k I(. | D(k)) = sum_k P(. | D(k))where I(x | D(k)) is an indicator function which is 1 if eventx occurs in training case k, và 0 otherwise.Given the expected counts, we maximize the parameters, and thenrecompute the expected counts, etc. This iterative sầu procedure isguaranteed khổng lồ converge to lớn a local maximum of the likelihood surface.It is also possible lớn vày gradient ascent on the likelihood surface (thegradient expression also involves the expected counts), but EM isusually faster (since it uses the natural gradient) and simpler (sinceit has no step kích thước parameter andtakes care of parameter constraints (e.g., the "rows" of theCPT having lớn sum khổng lồ one) automatically).In any case, we see than when nodes are hidden, inference becomes asubroutine which is called by the learning procedure; hence fastinference algorithms are crucial.

## Unknown structure, full observability

We start by discussing the scoring function which we use lớn selectmodels; we then discuss algorithms which attempt to optimize thisfunction over the space of models, & finally examine their computational andsample complexity.

### The objective function used for Mã Sản Phẩm selection

The maximum likelihood Model will be acomplete graph, since this has the largest number of parameters, và hence can fit the data the best.A well-principled way to lớn avoid this kind of over-fitting is to put a prior on models,specifying that we prefer sparse models.Then, by Bayes" rule, the MAPhường. mã sản phẩm is the one that maximizes
Taking logs, we find
where c = - log Pr(D) is a constant independent of G.The effect of the structure prior P(G) is equivalent khổng lồ penalizing overly complexmodels.However, this is not strictly necessary, since the marginal likelihoodtermP(D|G) = int_ heta P(D|G, heta)has a similar effect of penalizing models with too many parameters(this is known as Occam"s razor).

### Search algorithms for finding the best model

The goal of structure learning is lớn learn a dag (directed acyclicgraph) that best explains the data. This is an NP-hard problem, sincethe number of dag"s on N variables is super-exponential in N. (There is no closedkhung formula for this, but lớn give you an idea,there are 543 dags on 4 nodes, & O(10^18) dags on 10 nodes.)If we know the ordering of the nodes, life becomes much simpler,since we can learn the parent phối foreach node independently(since the score is decomposable), and we don"t need lớn worry about acyclicity constraints.For each node, there at mostsum_k=0^n choicenk = 2^nsets of possible parents for each node, which can bearranged in a lattice as shown below for n=4.The problem is to find the highest scoring point in this lattice.
There are three obvious ways lớn search this graph: bottom up, topdown, or middle out.In the bottom up approach, we start at the bottom of thelattice, và evaluate the score at all points in each successivemàn chơi.We must decide whether the gains in score produced by alarger parent set is ``worth it"".The standard approach in the reconstructibility analysis (RA) community uses the factthat chi^2(X,Y) approx I(X,Y) N ln(4), where N is the number ofsamples và I(X,Y) is the mutual information (MI) between X and Y.Hence we can use a chi^2 chạy thử khổng lồ decidewhether an increase in the MI score is statistically significant.(This also gives us some kind of confidence measure on the connectionsthat we learn.)Alternatively, we can use a BIC score.Of course, if we bởi vì not know if we have achieved the maximum possiblescore, we vì chưng not know when lớn stop searching, and hence we mustevaluate all points in the lattice (although we can obviously usebranch-and-bound). For large n, this iscomputationally infeasible, so a common approach is to lớn only search upuntil cấp độ K (i.e., assume a bound on the maximum number of parentsof each node), which takes O(n ^ K) time.The obvious way khổng lồ avoid the exponential cost (& the need for abound, K) is to lớn use heuristionaga.vn toavoid examining all possible subsets.(In fact, we must use heuristionaga.vn of some kind, since the problem oflearning optimal structure is NP-hard citeChickering95.)One approach in the RA framework, called Extended Dependency Analysis (EDA)citeConant88, is as follows.Start by evaluating all subsets of form size up lớn two, keep all the ones withsignificant (in the chi^2 sense) XiaoMI with the target node, & take the union of the resulting setas the phối of parents.The disadvantage of this greedy technique is that it will fail to lớn find a setof parents unless some subphối of form size two has significant MI with thetarget variable. However, a Monte Carlosimulation in citeConant88 shows that most random relations havethis property.In addition, highly interdependent sets of parents (which mightfail the pairwise MI test) violate the causalindependence assumption, which is necessary to lớn justify the use ofnoisy-OR và similar CPDs.An alternative technique, popular in the UAI community, is khổng lồ startwith an initial guess of the mã sản phẩm structure (i.e., at a specificpoint in the lattice), và then persize localsearch, i.e., evaluate the score of neighboring points in the lattice,and move sầu to the best such point, until we reach a local optimum. Wecan use multiple restarts to lớn try to find the global optimum, and tolearn an ensemble of models.chú ý that, in the partially observable case, we need to lớn have aninitial guess of the mã sản phẩm structure in order to lớn estimate the valuesof the hidden nodes, and hence the (expected) score of each model; starting with the fullydisconnected mã sản phẩm (i.e., at the bottom of the lattice) would be a badidea, since it would lead to lớn a poor estimate.

## Unknown structure, partial observability

Finally, we come lớn the hardest case of all, where the structure isunknown và there are hidden variables and/or missing data.In this case, to compute the Bayesian score, we must marginalize outthe hidden nodes as well as the parameters.Since this is usually intractable, it is comtháng to lớn usean asymptoticapproximation khổng lồ the posterior called BIC (Bayesian InformationCriterion), which is defined as follows:log Pr(D|G) approx log Pr(D|G, hatTheta_G) - fraclog N2 #Gwhere N is the number of samples,hatTheta_G is the ML estimate of the parameters,and#G is the dimension of the model.(In the fully observable case, the dimension of a Mã Sản Phẩm is the numberof free parameters. In a model with hidden variables, it might be lessthan this.)The first term is just the likelihood andthe second term is a penalty for Mã Sản Phẩm complexity.(The BIC score is identical lớn the Minimum Description Length (MDL)score.)Although the BIC score decomposes inkhổng lồ a sum of local terms, one pernode,local search is still expensive, because we need khổng lồ run EM at eachstep to compute hatTheta. An alternative approach is khổng lồ bởi vì thelocal search steps inside of the M step of EM - this is calledStructureal EM, và provably converges to lớn a local maximum of the BIonaga.vncore (Friedman, 1997).

### Inventing new hidden nodes

So far, structure learning has meant finding the right connectivitybetween pre-existing nodes. A more interesting problem is inventinghidden nodes on demvà. Hidden nodes can make a Model much morecompact, as we see below.
(a) A BN with a hidden variable H. (b) The simplest networkthat can capture the same distribution without using a hidden variable(created using arc reversal và node elimination).If H is binary and the other nodes are trinary, and we assume fullCPTs, the first network has 45 independent parameters, và the secondhas 708.The standard approach is to keep addinghidden nodes one at a time, to some part of the network (see below),performing structure learning at eachstep, until the score drops.One problem is choosing the cardinality (number of possible values)for the hidden node, và its type of CPD.Another problem ischoosing where to lớn add the new hidden node.There is no point making it a child, since hidden children can alwaysbe marginalized away, so we need to find an existing node which needsa new parent, when the current mix of possible parents is not adequate.citeRamachandran98 use the following heuristic for finding nodeswhich need new parents: they considera noisy-OR node which is nearly always on, even if its non-leakparents are off, as anindicator that there is a missing parent.Generalizing this technique beyond noisy-ORs is an interesting openproblem.One approach might be lớn examineH(X|Pa(X)):if this is very high, it means the current phối of parents areinadequate to ``explain"" the residual entropy; if Pa(X) is thebest (in the BIC or chi^2 sense) mix of parents we have sầu been ableto find in the current model, itsuggests we need to create a new node and add it khổng lồ Pa(X).A simple heuristic for inventing hidden nodes in the case of DBNs iskhổng lồ kiểm tra if the Markov property is being violated for any particularnode. If so, it suggests that we need connections lớn slices furtherbaông xã in time. Equivalently, we can add new lag variablesand connect khổng lồ them.Of course,interpreting the ``meaning"" of hidden nodes is alwaystricky, especially since they are often unidentifiable,e.g., we can often switch the interpretation of the true & false states(assuming for simplicity that the hidden node is binary) provided wealso permute the parameters appropriately. (Symmetries such as this areone cause of the multiple maxima in the likelihood surface.)

## Further reading on learning

The following are good tutorial articles. W. L. Buntine, 1994."Operations for Learning with Graphical Models",J. AI Retìm kiếm, 159--225. D. Heckerman, 1996."A tutorial on learning with Bayesian networks",Microsoft Retìm kiếm tech. report, MSR-TR-95-06.Decision TheoryIt is often said that "Decision Theory = Probability Theory + UtilityTheory".We have outlined above sầu how we can Mã Sản Phẩm joint probability distributions in acompact way by using sparse graphs khổng lồ reflect conditional independencerelationships.It is also possible lớn decompose multi-attribute utility functions imãng cầu similar way:we create a node for each term in the sum, whichhas as parents all the attributes (randomvariables) on which it depends; typically, the utility node(s) willhave action node(s) as parents, since the utility depends both on thestate of the world & the action we perkhung.The resulting graph is called an influence diagram.In principle, we can then use the influence diagram lớn computethe optimal (sequence of) action(s) khổng lồ perform so as to maximimizeexpected utility, although this is computationally intractible for allbut the smallest problems.Classical control theory is mostly concerned with the special casewhere the graphical model is aLinear Dynamical Systemand the utility function is negative sầu quadratic loss, e.g., consider amissile tracking an airplane: its goal is khổng lồ minimize the squareddistance between itself & the target. When the utility functionand/or the system Mã Sản Phẩm becomes more complicated, traditional methodsbreak down, & one has lớn use reinforcement learning to lớn find theoptimal policy (mapping from states to lớn actions).ApplicationsThe most widely used Bayes Nets are undoubtedly the ones embedded inMicrosoft"s products, including the AnswerWizard of Office 95, the Office Assistant (the bouncy paperclip guy) ofOffice 97, & over 30 Technical Support Troubleshooters.BNs originally arose out of an attempt to lớn add probabilities toexpert systems, & this is still the most comtháng use for BNs.A famous example isQMR-DT, a decision-theoretic reformulation of the Quick MedicalReference (QMR) mã sản phẩm.Here, the top layer represents hidden disease nodes, and the bottomlayer represents observed symptom nodes.The goal is to lớn infer the posterior probability of each disease givenall the symptoms (which can be present, absent or unknown).QMR-DT is so densely connected that exact inference is impossible. Various approximationmethods have been used, including sampling, variational & loopybelief propagation.Another interesting fielded application is theVista system, developedby Eric Horvitz.The Vista system is a decision-theoretic system that has been used atNASA Mission Control Center in Houston for several years. The systemuses Bayesian networks to lớn interpret live telemetry & provides adviceon the likelihood of alternative failures of the space shuttle"spropulsion systems. It also considers time criticality & recommendsactions of the highest expected utility. The Vista system also employsdecision-theoretic methods for controlling the display of informationto dynamically identify the most important information khổng lồ highlight.Horvitz has gone on to attempt to lớn apply similar technology toMicrosoft products, e.g., the Lumiere project.Special cases of BNs were independently invented by many differentcommunities, for use in e.g., genetionaga.vn (linkage analysis), speechrecognition (HMMs), tracking (Kalman fitering), data compression(density estimation)và coding (turbocodes), etc.For examples of other applications, see thespecial issue of Proc. ACM 38(3), 1995,and the MicrosoftDecision Theory Group page.

## Applications to biology

This is one of the hotchạy thử areas.For a review, seeInferring cellular networks using probabilistic graphical modelsScience, Nir Friedman, v303 p799, 6 Feb 2004.Recommended introductory reading

## Books

In reverse chronological order. Daphne Koller và Nir Friedman,"Probabilistic graphical models: principles & techniques",MIT Press 2009 Adnan Darwiche,"Modeling và reasoning with Bayesian networks",Cambridge 2009F. V. Jensen."Bayesian Networks and Decision Graphs".Springer.2001.Probably the best introductory book available.D. Edwards."Introduction khổng lồ Graphical Modelling", 2nd ed.Springer-Verlag.2000.Good treatment of undirected graphical models from a statistical perspective sầu. J. Pearl."Causality".Cambridge.2000.The definitive book on using causal DAG modeling. R. G. Cowell, A. Phường. Dawid, S. L. Lauritzen andD. J. Spiegelhalter."Probabilistic Networks & Expert Systems".Springer-Verlag.1999.Probably the best book available,although the treatment is restricted khổng lồ exact inference. M. I. Jordan (ed)."Learning in Graphical Models".MIT Press.1998.Loose collection of papers on machine learning, many related tographical models.One of the few books lớn discuss approximate inference. B. Frey."Graphical models for machine learning và digital communication",MIT Press.1998.Discusses pattern recognition & turbocodes using (directed)graphical models. E. Castillo & J. M. Gutierrez & A. S. Hadi."Expert systems và probabilistic network models".Springer-Verlag, 1997.A Spanishversion is available online for không tính tiền. F. Jensen."An introduction khổng lồ Bayesian Networks".UCL Press.1996.Out of print.Superceded by his 2001 book. S. Lauritzen."Graphical Models",Oxford.1996.The definitive sầu mathematical exposition of the theory of graphicalmodels. S. Russell & P. Norvig."Artificial Intelligence: A Modern Approach".Prentice Hall.1995. Popular undergraduate textbook that includes a readable chapter ondirected graphical models. J. Whittaker."Graphical Models in Applied Multivariate Statistionaga.vn",Wiley.1990.This is the first book published on graphical modelling from a statistionaga.vnperspective sầu. R. Neapoliton."Probabilistic Reasoning in Expert Systems".John Wiley và Sons.1990. J. Pearl."Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference."Morgan Kaufmann.1988. The book that got it all started!A very insightful book, still relevant today.

## review articles

Phường. Smyth, 1998."Belief networks, hidden Markov models, và Markovrandom fields: a unifying view",Pattern Recognition Letters. E. Charniak, 1991."Bayesian Networks without Tears", AI magazine.Sam Roweis & Zoubin Ghahramani, 1999.A Unifying đánh giá of Linear Gaussian Models,Neural Computation 11(2) (1999) pp.305-345

## Exact Inference

C. Huang and A. Darwibịt, 1996."Inference in Belief Networks: A procedural guide",Intl. J. Approximate Reasoning, 15(3):225-263. R. McEliece and S. M. Aji, 2000.The Generalized Distributive sầu Law,IEEE Trans. Insize. Theory, vol. 46, no. 2 (March 2000),pp. 325--343. F. Kschischang, B. Frey and H. Loeliger, 2001.Factorgraphs & the sum hàng hóa algorithm,IEEE Transactions on Information Theory, February, 2001. M. Peot and R. Shachter, 1991."Fusion and propogation with multiple observations in belief networks",Artificial Intelligence, 48:299-318.

## Approximate Inference

M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, và L. K. Saul, 1997."An introduction khổng lồ variational methods for graphical models." D. MacKay, 1998."An introduction lớn Monte Carlo methods". T. Jaakkola và M. Jordan, 1998."Variational probabilistic inference & the QMR-DT database"

## Learning

W. L. Buntine, 1994."Operations for Learning with Graphical Models",J. AI Retìm kiếm, 159--225. D. Heckerman, 1996."A tutorial on learning with Bayesian networks",Microsoft Retìm kiếm tech. report, MSR-TR-95-06.

Xem thêm: Đau Các Khớp Ngón Tay Là Bệnh Gì ? Xem Ngay! Các Nguyên Nhân Gây Ra Đau Khớp Ngón Tay

## DBNs

L. R. Rabiner, 1989."A Tutorial in Hidden Markov Models and Selected Applications in Speech Recognition",Proc. of the IEEE, 77(2):257--286. Z. Ghahramani, 1998. Learning Dynamic Bayesian Networks In C.L. Giles and M. Gori (eds.), Adaptive Processing of Sequences & Data Structures . Lecture Notes in Artificial Intelligence, 168-197. Berlin: Springer-Verlag.