 07 Jan 2020 | 19 views | 0 downloads | 15 Pages | 299.21 KB

Share Pdf : Chapter 2 Estimating Probabilities

Report CopyRight/DMCA Form For : Chapter 2 Estimating Probabilities

## Transcription

c 2016 Tom M Mitchell 2,Gender HoursWorked Wealth probability. female 40 5 poor 0 2531,female 40 5 rich 0 0246,female 40 5 poor 0 0422. female 40 5 rich 0 0116,male 40 5 poor 0 3313,male 40 5 rich 0 0972. male 40 5 poor 0 1341,male 40 5 rich 0 1059, Table 1 A Joint Probability Distribution This table defines a joint probability distri. bution over three random variables Gender HoursWorked and Wealth. Gender the number of HoursWorked each week and their Wealth In general. defining a joint probability distribution over a set of discrete valued variables in. volves three simple steps, 1 Define the random variables and the set of values each variable can take.
on For example in Table 1 the variable Gender can take on the value. male or female the variable HoursWorked can take on the value 40 5. or 40 5 and Wealth can take on values rich or poor. 2 Create a table containing one row for each possible joint assignment of val. ues to the variables For example Table 1 has 8 rows corresponding to the 8. possible ways of jointly assigning values to three boolean valued variables. More generally if we have n boolean valued variables there will be 2n rows. in the table, 3 Define a probability for each possible joint assignment of values to the vari. ables Because the rows cover every possible joint assignment of values. their probabilities must sum to 1, The joint probability distribution is central to probabilistic inference because. once we know the joint distribution we can answer every possible probabilistic. question that can be asked about these variables We can calculate conditional or. joint probabilities over any subset of the variables given their joint distribution. This is accomplished by operating on the probabilities for the relevant rows in the. table For example we can calculate, The probability that any single variable will take on any specific value For. example we can calculate that the probability P Gender male 0 6685. for the joint distribution in Table 1 by summing the four rows for which. Gender male Similarly we can calculate the probability P Wealth. rich 0 2393 by adding together the probabilities for the four rows cover. ing the cases for which Wealth rich,c 2016 Tom M Mitchell 3. The probability that any subset of the variables will take on a particular joint. assignment For example we can calculate that the probability P Wealth rich. Gender female 0 0362 by summing the two table rows that satisfy this. joint assignment, Any conditional probability defined over subsets of the variables Recall.
the definition of conditional probability P Y X P X Y P X We can. calculate both the numerator and denominator in this definition by sum. ming appropriate rows to obtain the conditional probability For example. according to Table 1 P Wealth rich Gender female 0 0362 0 3315. To summarize if we know the joint probability distribution over an arbi. trary set of random variables X1 Xn then we can calculate the conditional. and joint probability distributions for arbitrary subsets of these variables e g. P Xn X1 Xn 1 In theory we can in this way solve any classification re. gression or other function approximation problem defined over these variables. and furthermore produce probabilistic rather than deterministic predictions for. any given input to the target function 1 For example if we wish to learn to. predict which people are rich or poor based on their gender and hours worked. we can use the above approach to simply calculate the probability distribution. P Wealth Gender HoursWorked,1 1 Learning the Joint Distribution. How can we learn joint distributions from observed training data In the example. of Table 1 it will be easy if we begin with a large database containing say descrip. tions of a million people in terms of their values for our three variables Given a. large data set such as this one can easily estimate a probability for each row in the. table by calculating the fraction of database entries people that satisfy the joint. assignment specified for that row If thousands of database entries fall into each. row we will obtain highly reliable probability estimates using this strategy. In other cases however it can be difficult to learn the joint distribution due to. the very large amount of training data required To see the point consider how our. learning problem would change if we were to add additional variables to describe. a total of 100 boolean features for each person in Table 1 e g we could add do. they have a college degree are they healthy Given 100 boolean features. the number of rows in the table would now expand to 2100 which is greater than. 1030 Unfortunately even if our database describes every single person on earth. we would not have enough data to obtain reliable probability estimates for most. rows There are only approximately 1010 people on earth which means that for. most of the 1030 rows in our table we would have zero training examples This. 1 Of course if our random variables have continuous values instead of discrete we would need. an infinitely large table In such cases we represent the joint distribution by a function instead of a. table but the principles for using the joint distribution remain unchanged. c 2016 Tom M Mitchell 4, is a significant problem given that real world machine learning applications often. use many more than 100 features to describe each example for example many. learning algorithms for text analysis use millions of features to describe text in a. given document, To successfully address the issue of learning probabilities from available train. ing data we must 1 be smart about how we estimate probability parameters from. available data and 2 be smart about how we represent joint probability distribu. 2 Estimating Probabilities, Let us begin our discussion of how to estimate probabilities with a simple exam. ple and explore two intuitive algorithms It will turn out that these two intuitive. algorithms illustrate the two primary approaches used in nearly all probabilistic. machine learning algorithms, In this simple example you have a coin represented by the random variable.
c 2016 Tom M Mitchell 7, the learning task is to estimate the unknown value of P X 1 for a boolean. valued random variable X based on a sample of n values of X drawn indepen. dently e g n independent flips of a coin with probability of heads In this. figure the true value of is 0 3 and the same sequence of training examples is. used in each plot Consider first the plot in the upper left The blue line shows. the estimates of produced by Algorithm 1 MLE as the number n of training. examples grows The red line shows the estimates produced by Algorithm 2 us. ing the same training examples and using priors 0 42 and 1 18 This prior. assumption aligns with the correct value of i e 1 1 0 0 3 Note. that as the number of training example coin flips grows both algorithms converge. toward the correct estimate of though Algorithm 2 provides much better esti. mates than Algorithm 1 when little data is available The bottom left plot shows. the estimates if Algorithm 2 uses even more confident priors captured by twice as. many imaginary examples 0 84 and 1 36 The two plots on the right side. of the figure show the estimates produced when Algorithm 2 MAP uses incor. rect priors where 1 1 0 0 4 The difference between the top right and. bottom right plots is again only a difference in the number of imaginary examples. reflecting the difference in confidence that should be close to 0 4. 2 1 Maximum Likelihood Estimation MLE, Maximum Likelihood Estimation often abbreviated MLE estimates one or more. probability parameters based on the principle that if we observe training data D. we should choose the value of that makes D most probable When applied to. the coin flipping problem discussed above it yields Algorithm 1 The definition. of the MLE in general is,MLE arg max P D 1, The intuition underlying this principle is simple we are more likely to observe. data D if we are in a world where the appearance of this data is highly probable. Therefore we should estimate by assigning it whatever value maximizes the. probability of having observed D, Beginning with this principle for choosing among possible estimates of it. is possible to mathematically derive a formula for the value of that provably. maximizes P D Many machine learning algorithms are defined so that they. provably learn a collection of parameter values that follow this maximum likeli. hood principle Below we derive Algorithm 1 for our above coin flip example. beginning with the maximum likelihood principle, To precisely define our coin flipping example let X be a random variable.
which can take on either value 1 or 0 and let P X 1 refer to the true but. possibly unknown probability that a random draw of X will take on the value 1 2. Assume we flip the coin X a number of times to produce training data D in which. 2A random variable defined in this way is called a Bernoulli random variable and the proba. bility distribution it follows defined by is called a Bernoulli distribution. c 2016 Tom M Mitchell 8, we observe X 1 a total of 1 times and X 0 a total of 0 times We further. assume that the outcomes of the flips are independent i e the result of one coin. flip has no influence on other coin flips and identically distributed i e the same. value of governs each coin flip Taken together these assumptions are that the. coin flips are independent identically distributed which is often abbreviated to. The maximum likelihood principle involves choosing to maximize P D. Therefore we must begin by writing an expression for P D or equivalently. P 1 0 in terms of then find an algorithm that chooses a value for that. maximizes this quantity To begin note that if data D consists of just one coin flip. then P D if that one coin flip results in X 1 and P D 1 if the. result is instead X 0 Furthermore if we observe a set of i i d coin flips such. as D 1 1 0 1 0 then we can easily calculate P D by multiplying together. the probabilities of each individual coin flip,P D 1 1 0 1 0 1 1 3 1 2. In other words if we summarize D by the total number of observed times 1 when. X 1 and the number of times 0 that X 0 we have in general. P D h 1 0 i 1 1 0 2, The above expression gives us a formula for P D h 1 0 i The quantity. P D is often called the data likelihood or the data likelihood function because. it expresses the probability of the observed data D as a function of This likeli. hood function is often written L P D, Our final step in this derivation is to determine the value of that maximizes. the data likelihood function P D h 1 0 i Notice that maximizing P D. with respect to is equivalent to maximizing its logarithm ln P D with respect. to because ln x increases monotonically with x,arg max P D arg max ln P D.
It often simplifies the mathematics to maximize ln P D rather than P D as. is the case in our current example In fact this log likelihood is so common that it. has its own notation ln P D, To find the value of that maximizes ln P D and therefore also maximizes. P D we can calculate the derivative of ln P D h 1 0 i with respect to. then solve for the value of that makes this derivative equal to zero Because. ln P D is a concave function of the value of where this derivative is zero. will be the value that maximizes ln P D First we calculate the derivative of. the log of the likelihood function of Eq 2,c 2016 Tom M Mitchell 9. 1 ln 0 ln 1, where the last step follows from the equality x x and where the next to last. f x f x g x,step follows from the chain rule x g x x. Finally to calculate the value of that maximizes we set the derivative. in equation 3 to zero and solve for, Thus we have derived in equation 4 the intuitive Algorithm 1 for estimating.
starting from the principle that we want to choose the value of that maximizes. MLE arg max P D arg max ln P D 5, This same maximum likelihood principle is used as the basis for deriving many. machine learning algorithms for more complex problems where the solution is not. so intuitively obvious, 2 2 Maximum a Posteriori Probability Estimation MAP. Maximum a Posteriori Estimation often abbreviated to MAP estimation esti. mates one or more probability parameters based on the principle that we should. choose the value of that is most probable given the observed data D and our. prior assumptions summarized by P,MAP arg max P D, When applied to the coin flipping problem discussed above it yields Algorithm. 2 Using Bayes rule we can rewrite the MAP principle as. MAP arg max P D arg max,c 2016 Tom M Mitchell 10, and given that P D does not depend on we can simplify this by ignoring the. denominator,MAP arg max P D arg max P D P 6, Comparing this to the MLE principle described in equation 1 we see that whereas.
the MLE principle is to choose to maximize P D the MAP principle instead. maximizes P D P The only difference is the extra P. To produce a MAP estimate for we must specify a prior distribution P. that summarizes our a priori assumptions about the value of In the case where. data is generated by multiple i i d draws of a Bernoulli random variable as in our. coin flip example the most common form of prior is a Beta distribution. P Beta 0 1 7, Here 0 and 1 are parameters whose values we must specify in advance to define. a specific P As we shall see choosing values for 0 and 1 corresponds to. choosing the number of imaginary examples 0 and 1 in the above Algorithm. 2 The denominator B 0 1 is a normalization term defined by the function B. which assures the probability integrates to one but which is independent of. As defined in Eq 6 the MAP estimate involves choosing the value of that. maximizes P D P Recall we already have an expression for P D in Eq. 2 Combining this with the above expression for P we have. MAP arg max P D P,1 0 1 1 1 0 1,1 1 1 1 0 0 1,arg max 1 1 1 1 0 0 1 8. where the final line follows from the previous line because B 0 1 is indepen. How can we solve for the value of that maximizes the expression in Eq 8. Fortunately we have already answered this question Notice that the quantity we. seek to maximize in Eq 8 can be made identical to the likelihood function in Eq. 2 if we substitute 1 1 1 for 1 in Eq 2 and substitute 0 0 1. for 0 We can therefore reuse the derivation of MLE beginning from Eq 2 and. ending with Eq 4 simply by carrying through this substitution Applying this. same substitution to Eq 4 implies the solution to Eq 8 is therefore. MAP arg max P D P 9,1 1 1 0 0 1,c 2016 Tom M Mitchell 11. Figure 2 Prior left and posterior right probability distributions on in a MAP. estimate for Consider estimating P X 1 for a boolean valued Bernoulli. random variable X Suppose we set a prior P defined by a Beta distribution with. 0 3 1 4 as shown on the left Suppose further that we now observe data D con. sisting of 100 examples 50 where we observe X 1 and 50 where X 0 Then the. posterior probability P D over given this observed data D which is proportional to. P D P is another Beta distribution with 0 53 1 54 as shown on the right. Notice that both distributions assign non zero probability to every possible value of. between 0 and 1 though the posterior distribution has most of its probability mass near. Thus we have derived in Eq 9 the intuitive Algorithm 2 for estimating. starting from the principle that we want to choose the value of that maximizes. P D The number 1 of imaginary heads in Algorithm 2 is equal to 1 1. and the number 0 of imaginary tails is equal to 0 1 This same maximum. a posteriori probability principle is used as the basis for deriving many machine. learning algorithms for more complex problems where the solution is not so intu. itively obvious as it is in our coin flipping example. 2 2 1 MAP Priors and Posteriors, Why did we chose above to use the Beta 0 1 family of probability distributions. to define our prior probability P when calculating the MAP estimate MAP. We made this choice because the Beta distribution has a functional form that is. the same as the data likelihood P D in our problem so that when we multiply. these two forms together to get the quantity P D P this product is easily. expressed as yet another expression of this same functional form Furthermore in. this product the 0 and 1 parameters that define our Beta distribution play exactly. the same role as the observed data counts that is they capture the effect of the. Beta prior P in a form that is equivalent to specifying the number of imaginary. examples used in our earlier Algorithm 23, Figure 2 shows an example of the prior P and posterior P D P D P.
distributions corresponding to a MAP estimate in our example. 3 More precisely the number of imaginary examples i in Algorithm 2 is given by i 1. c 2016 Tom M Mitchell 12, The Beta 0 1 distribution defined in Eq 7 is called the conjugate prior. for the binomial likelihood function 1 1 0 because the posterior distribu. tion P D P is also a Beta distribution More generally any P is called the. conjugate prior for a likelihood function L P D if the posterior P D is. of the same form as P,3 Working with Other Probability Distributions. Formally the probability distribution we considered above is called a Bernoulli. distribution it governs a random variable X which can take on two possible val. ues 0 or 1 and this Bernoulli distribution is defined by the single parameter. i e P X 1 and P X 0 1 We will sometimes refer to a. boolean valued random variable which is governed by a Bernoulli distribution as. a Bernoulli random variable As noted above the conjugate prior for estimating. the parameter of a Bernoulli distribution is the Beta distribution. 3 1 Discrete Valued Random Variables with Many Values. If we have a random variable that can take on more than two values then we need. more than one parameter to describe the probability distribution for that variable. For example consider a six sided die which when rolled can come up with any. of 6 possible results, A common approach to characterizing such n valued random variables is to. use a generalization of the Bernoulli distribution called a Categorical distribu. tion where we assign a different probability to each possible value of the ran. dom variable For example we might model a six sided die as a random vari. able X that can take on the values 1 through 6 and represent its probability. distribution by a vector of six different parameters h 1 6 i where the. parameter i describes the probability that X will take on its ith value that is. i P X i The likelihood function L P D for estimating the vector. of parameters from observed rolls of the die is a simple generalization of the. likelihood for estimating a Bernoulli distribution It takes the form of a product. L P D 1 1 2 2 n n where i indicates the observed count of times. when the value X i was observed in the data Given a sample of data drawn. from a particular Categorical distribution for a random variable that can take on n. possible values the maximum likelihood estimate for i is given by. where j indicates the number of times the value X j was observed in the data. Note that just like the case of a binary valued random variable see eq 5 the. MLE is simply the fraction of times the particular value was observed in the data. If we prefer a MAP estimate for an n valued random variable governed by a. Categorical distribution we use the conjugate prior for the Categorical distribu. tion which is called the Dirichlet distribution Of course given that the Categorical. c 2016 Tom M Mitchell 13, distribution has n different i parameters its prior will have to specify the proba. bility for each joint assignment of these n parameters The Dirichlet distribution. is a generalization of the Beta distribution and has the form. where the denominator is again a normalizing function to assure that the total. probability mass is 1 and where this normalizing function B 1 n is inde. pendent of the vector of parameters h 1 n i and therefore can be ignored. when deriving their MAP estimates, The MAP estimate for each i for a Categorial distribution is given by.
1 1 1 n n 1, where j indicates the number of times the value X j was observed in the. data and where the j s are the parameters of the Dirichlet prior which reflects. our prior knowledge or assumptions Here again we can view the MAP estimate. as combining the observed data given by the j values with j 1 additional. imaginary observations for X j Comparing this formula to the earlier formula. giving the MAP estimate for a Bernoulli random variable eq 9 it is easy to see. that this is a direct generalization of that simpler case and that it again follows the. intuition of our earlier Algorithm 2,4 What You Should Know. The main points of this chapter include, Joint probability distributions lie at the core of probabilistic machine learn. ing approaches Given the joint probability distribution P X1 Xn over a. set of random variables it is possible in principle to compute any joint or. conditional probability defined over any subset of these variables. Learning or estimating the joint probability distribution from training data. can be easy if the data set is large compared to the number of distinct prob. ability terms we must estimate But in many practical problems the data. is more sparse requiring methods that rely on prior knowledge or assump. tions in addition to observed data, Maximum likelihood estimation MLE is one of two widely used principles. for estimating the parameters that define a probability distribution This. principle is to choose the set of parameter values MLE that makes the ob. served training data most probable over all the possible choices of. MLE arg max P data,c 2016 Tom M Mitchell 14, In many cases maximum likelihood estimates correspond to the intuitive.
notion that we should base probability estimates on observed ratios For. example given the problem of estimating the probability that a coin will. turn up heads given 1 observed flips resulting in heads and 0 observed. flips resulting in tails the maximum likelihood estimate corresponds exactly. to taking the fraction of flips that turn up heads. MLE arg max P data, Maximium a posteriori probability MAP estimation is the other of the two. widely used principles This principle is to choose the most probable value. of given the observed training data plus a prior probability distribution. P which captures prior knowledge or assumptions about the value of. MAP arg max P data arg max P data P, In many cases MAP estimates correspond to the intuitive notion that we. can represent prior assumptions by making up imaginary data which re. flects these assumptions For example the MAP estimate for the above coin. flip example assuming a prior P Beta 0 1 1 1 yields a MAP. estimate which is equivalent to the MLE estimate if we simply add in an. imaginary 1 heads and 0 tails to the actual observed 1 heads and 0 tails. MAP arg max P data P, 1 In the MAP estimation of for our Bernoulli random variable X in this. chapter we used a Beta 0 1 prior probability distribution to capture our. prior beliefs about the prior probability of different values of before see. ing the observed data, Plot this prior probability distribution over corresponding to the. number of imaginary examples used in the top left plot of Figure 1. i e 0 42 1 18 Specifically create a plot showing the prior. probability vertical axis for each possible value of between 0 and 1. horizontal axis as represented by the prior distribution Beta 0 1. Recall the correspondence i i 1 Note you will want to write a. simple computer program to create this plot, Above you plotted the prior probability over possible values of.
Now plot the posterior probability distribution over given that prior. plus observed data in which 6 heads X 1 were observed along. with 9 tails X 0,c 2016 Tom M Mitchell 15, View the plot you created above to visually determine the approximate. Maximum a Posterior probability estimate MAP What is it What is. the exact value of the MAP estimate What is the exact value of the. Maximum Likelihood Estimate MLE, How do you think your plot of the posterior probability would change. if you altered the Beta prior distribution to use 0 420 1 180. hint it s ok to actually plot this What if you changed the Beta prior. to 0 32 1 28,5 Acknowledgements, I very much appreciate receiving helpful comments on earlier drafts of this chapter. from Ondr ej Filip Ayush Garg Akshay Mishra and Tao Chen Andrew Moore. provided the data summary shown in Table 1,REFERENCES. Mitchell T 1997 Machine Learning McGraw Hill, Wasserman L 2004 All of Statistics Springer Verlag.

## Related Books

###### INTERIOR WOOD DOORS - River City Millwork 5 Pine Framed Opening 4-9/16 KD Size Price Ball catch strike plate colors: <=3/0x6/8 \$25.00 >3/0x6/8 \$30.00 Add \$4.25 for set up

###### Metoder for rehabilitering av fuktskadde kjellerytervegger 1 Norsk bygningsfysikkdag 2012 Metoder for rehabilitering av fuktskadde kjellerytervegger Stig Geving, prof. Institutt for bygg, anlegg og transport

###### BARBECUES : 3 - 4 SERIES WOODY L - LD - LX CLASS 3 WLX ... 1 3 2 screw bag / sachet de vis 5010001641 jet / injecteur 5010002164 (28mb) 5010002448 (50mb) 5010002171 5010001640 5010002634 barbecues : 3 - 4 series woody l - ld - lx class 3 wlx 3 series woody l 3 series woody lx class 3 wlx 4 series woody l 4 series woody lx class 3 wlx fr (piezo) 3 series woody ld 31906 5010000873 74881 5010001593 5010001596 5010001668 5010001597 5010001670 5010001598 ...

###### Scott Foresman Science - Plain Local Schools Illustration Title Page, 20, 21, 22 Adam Benton Photographs: Every effort has been made to secure permission and provide appropriate credit for photographic material. The publisher deeply regrets any omission and pledges to correct errors called to its attention in subsequent editions.

###### CONCEALED TYPE LAVATORY CARRIER - Josam 46 Fixture Carriers The drawings on this page show the relationship between mating parts of carriers covered in this section. Parts & Details Lavatory, Sink & Urinal ...

###### 2016 ANNUAL REPORT AND 2017 PRIORITIES Technology ... 2016 ANNUAL REPORT AND 2017 PRIORITIES Technology Advancement Program 10 Years of Progress Moving Towards Zero Emissions March 2017

###### ABORIGINAL AND TORRES STRAIT ISLANDER GUIDELINES FOR DRAMA ... Need by Jack Davis ... ABORIGINAL AND TORRES STRAIT ISLANDER GUIDELINES FOR DRAMA/THEATRE EDUCATION ... Aboriginal and Torres Strait Islander theatre practices for ...

###### Encyclopedia of Micro?uidics and Nano?uidics Aerospace Research and Education Center (AeREC) and Aerospace Engineering Program Washington University St. Louis, MO USA [email protected] AI,LISONG Department of Biomedical Engineering & Cardiovascular Medicine School of Engineering & School of Medicine University of Southern California Los Angles, CA USA AKHMECHET,ROMAN Sibley School of ...

###### LM185/LM285/LM385 Adjustable Micropower Voltage References LM185/LM285/LM385 Adjustable Micropower Voltage References General Description The LM185/LM285/LM385 are micropower 3-terminal ad-justable band-gap voltage reference diodes. Operating from 1.24 to 5.3V and over a 10 mA to 20 mA current range, they feature exceptionally low dynamic impedance and good temperature stability. On-chip trimming is ...

###### LM185-2.5/LM285-2.5/LM385-2.5 Micropower Voltage Reference ... Absolute Maximum Ratings (Notes 1, 2) If Military/Aerospace specified devices are required, please contact the National Semiconductor Sales Office/ Distributors for availability and specifications.