Naive Bayes
Let’s focus on this table.
Subscript, , is used to represent the feature/dimension. Superscript, , is used to represent the observation (here the observation).
First, we make some basic assertions about the data.
Distributions
Each has a Categorical distribution:
(1)
The following is equivalent:
(2)
, and is the probability of the random variable taking on the value . This means that each comes from a distribution where the possible outcomes are from the set (in the table ). So:
(3)
Since this is a probability distribution, we have the constraint . Given a value for , each has a Bernoulli distribution (takes on values in ):
(4)
The following is equivalent:
(5)
The distribution parameters used above are sometimes summarised into a single parameter:
(6)
Likelihood
The probability of seeing this Data, , given the parameters can be represented as follows:
(7)
Note that is the same as the strict notation . The last equality above is due to conditional probability. is only dependent on and :
(8)
The “naive” assumption that the dimensions are independent allows us to write the above as follows:
(9)
In Equation (4), the distribution of is defined when conditioned upon . So we apply the law of total probability:
(10)
Replacing the distributions defined earlier:
(11)
Equation (11) is the likelihood of a single observation. The likelihood for the entirety of the data is as follows:
(12)
Notice how the product over all observations is limited to a product over all limitations for that particular class.
MLE
To find the maximum likelihood estimate we maximise Equation (12) with respect to and subject to constraint in Equation (3). If we try to find the maximum of Equation (12) directly, we will have a hard time with the derivatives. However, if we apply the function to Equation (11), this will greatly simplify the calculus:
(13)
We go ahead and take the log:
(14)
Noting that is the total number of 1s in dimension for class (or equivalently the sum of columns for class ) and also that is the number of observations in class , we have:
(15)
We apply the constraint on :
(16)
Taking the derivative of Equation (16) recovers the constraint:
(17)
Next we take the derivative with respect to :
(18)
Summing Equation (18):
(19)
Putting Equation (19) into Equation (18) we obtain a solution for with the constraint:
(20)
Similarly, we obtain :
(21)
Solving for :
(22)
We have the following MLE solutions:
(23)
where
(24)
Classifying with Naive Bayes
Using the above, given a new data point , the probability that is then:
(25)
which is just Equation (11) with the MLE solutions for a single observation and for class only. We can then find the class which has the maximum probability (notice that we don’t have to actually compute the denominator in Equation (25) since it’s the same for each ):
(26)
Let’s test it out on Table 1 for only the first 3 columns (). We first calculate the MLE solutions (using equations (23) and (24)):
(27)
Plugging these into Equation (26) we find since:
(28)
This makes sense since, according to the data, if you get a data point , it’s most likely going to be of class . There is still some small chance it would be class because has occurred and has occurred, albeit not simultaneously. However, has never occurred for class , hence should be given the lowest probability. An important note about Equations (28) is that the probabilities may not sum to 1. This is because there is a denominator in Equation (25) which has been omitted since it is the same for all cases.
Bayesian Naive Bayes
The problem with the example above is that is given a zero probability. This represents a problem which would become worse at higher dimensions. In Equation (27), takes on the value 0 since as there has never been a case where in class . However, this is very likely to be the case at higher dimensions where it will be common for a few columns to not align with the new observation. We would still want to obtain a non-zero probability for . To do this we approach Naive Bayes from a Bayesian point of view providing prior distributions for and .
Let’s look at the general form of Equation (28) to see how we would classify, , in a Bayesian setting given a new observation :
(29)
Note that the last equality in Equation (29) makes the assumption that the random variables of different dimensions are independent from each other. We know from our previous assumption (Equation (4)) that . So we will make these Bernoulli parameters () explicit in the first term in the numerator of Equation (29) via the theorem of total probability:
(30)
Applying the definition of conditional probability:
(31)
Notice that in the last equality of Equation (31) we dropped the dependence of the first term on the data () since the new observation is not dependent on the prior dataset in the approach. Similarly, the second term in the numerator of Equation (29) can be written:
(32)
In Equation (32) we skipped the step of explicitly applying the theorem of total probability as we did in Equation (31). Putting equations (31) and (32) into Equation (29):
(33)
We removed the denominator in Equation (29) and hence Equation (33) is a proportionality equation. Equation (33) contains 2 terms that are posterior probabilities for the parameters:
(34)
where is as defined in Equation (6). The first term in the numerator of Equation (34) is the likelihood for while the second term is the prior. The likelihood is given by Equation (12) which we derived during the MLE calculation. To make the multiplication of the likelihood and the prior easier, we can choose a prior distribution for that is similar in form as the likelihood. Here, we will make another assumption: the prior parameters are independent of each other. So that:
(35)
If we let
(36)
so that
(37)
and
(38)
so that
(39)
Equation (35) becomes
(40)
such that . This constraint has been moved out of the equation. It just means is between 0 and 1 for any .
Equation (40) is in the same form as the likelihood. Putting equations (12) and (40) into (34) to get the posterior:
(41)
where the denominator in Equation (34) has been absorbed into the constant . We are now going to rearrange and collect terms. Collapsing the multiplicative terms over , i.e. :
(42)
There are two observations which informed the last equality in Equation (42): The posterior is a probability which satisfies the axioms of a probability measure; the penultimate equality in Equation (42) has the same form as a Beta distribution and a Dirichlet distribution. Then these distributions must indeed be the Beta and Dirichlet distributions with and the other constants combining to ensure integration to unity. This is an informal justification of this but suffices for us here.
Putting Equation (42) into Equation (33) gives us a way for making a classification (Bayesian Naive Bayes):
(43)
In the second equality we just replaced the definitions of the Bernoulli and the Categorical distributions. In the last equality we split the product into to terms with power 1 and terms with power 0. Notice that the integrals are the expectations of the corresponding distributions (i.e. let have distribution . Then and is just the expectation). We can then rewrite Equation (43) as:
(44)
The mean of a distribution is and the mean of a distribution is . The means in Equation (44) are then:
(45)
Typically, for an uninformative prior we set and . These are just uniform distributions and its higher dimensional analogue. Notice that Equation (44) is exactly the same as Equation (25) but with the parameter estimates from a Bayesian perspective – they are now means of the corresponding posterior distributions incorporating prior weights and not just the likelihoods.
Classifying with Bayesian Naive Bayes
Now we will generate a new classification for the same observation . Equation (27) for the parameters becomes:
(46)
The final probabilities for each class are then:
(47)
Just as in Equation (28), we have 0 as the most likely class with class 2 the second most likely. However, this time we predict class 1 with a non-zero probability.