Maximum Likelihood Estimation (MLE) and Maximum A Posterior (MAP) in Machine Learning

6 min readJul 22, 2021

The Frequentist advocates Maximum Likelihood Estimation (MLE), which is equivalent to minimizing the Cross Entropy or KL Divergence between data and model.
The Bayesianism advocates Maximum A Posterior (MAP), which is equivalent to maximizing Likelihood and consider Regularization Term at the same time. We can also see how L1 and L2 Regularization Term are derived in this article.

First of all, we must mention the conditional probability

＊Marginal probability：

＊conditional probability：

＊Chain rule of conditional probability

＊Independence (assuming X and Y are independent)

Frequentist v.s. Bayesianism

The Frequentist believes that the world is stable, and there is a stable population behind all phenomena. Therefore, we can estimate various statistics through expected values as long as we use a large number of random and repeatable experiments from the population.

The Bayesianism believes that there is no stable population for some problems so that you can “mass” measurement in a short period of time. Because the system is constantly evolving, we need to have a statistical model with evolutionary characteristics, that is, Bayesian probability. Through the existing prior probability, we can update this probability by continuously observing new evidence .

In machine learning, the advantage of the Frequentist is that there are no additional assumptions. On the contrary, the Bayesianism needs to assume Prior Probability. The wrong prior probability may mislead the model, causing it to ignore or even distort the information from data. In addition, the advantage of the Bayesianism is that it has a prior probability, so it is less prone to errors when the number of data is small. If you use the views of the Frequentist on a small sample, your estimates may be inaccurate due to insufficient statistics.

Maximum Likelihood Estimation (MLE)

Given a dataset D, which contains X and Y, and determine the hyperparameter m of the model. At this time, we are looking for the weight θ of model.

The Frequentist believes that as long as the model has been determined first (determining m and θ), and then calculating the probability of dataset D generated by model, this is Likelihood. We hope that this Likelihood can be as high as possible, which is Maximum Likelihood.

Assuming sampling data is i.i.d. (independent and identically distributed)

Next, for the convenience of calculation, we will take ln

Then, consider the case of supervised learning, each piece of data d_i corresponds to a set (x_i, y_i):

Where P(x_i) has nothing to do with θ：

Multiply the above equation by the minus sign and divide by the number of data N :

This means that when we are doing Maximum Likelihood, we are also minimizing the Cross Entropy of Data and Model.
Of course, we can also say that Maximum Likelihood is minimizing KL Divergence:

But ln P_data(x) has nothing to do with θ, so the calculation is still only the Cross Entropy of the second term.

Maximum A Posterior (MAP)

MLE believes that the probability of occurrence of all θ is equal, and optimization goal of MLE is P(D∣m,θ). In the formula, a set of θ is directly given, so how does this θ come from and what the probability of occurrence of θ is are not in the consideration of MLE. The probability of occurrence of θ is assumed in MAP. And when you are optimizing MAP, Regularization Term will be derived at the same time.

First, let’s derive Bayes theorem:

Because m is given, we want to move it backward

So we got the Bayes theorem

P(D∣m,θ) is likelihood
P(θ|m) is prior probability. It describes the probability that the parameter θ will appear after a given m. This term is prior to experience. The experience here refers to “This round of the experiment”, and this P(θ∣m) has nothing to do with this round of the experiment. Therefore, P(θ∣m) needs to be given or be calculated from past historical records.
P(D∣m) is the probability of data. Usually the Hyperparameters of the data and the model should be independent, so it can actually be abbreviated as P(D), which can easily be count from the dataset.
P(θ∣D,m) is called the posterior probability, which means that the probability of occurrence of θ, given that the dataset D and Hyperparameters m. And this probability will become the prior probability for the next round of experiments. This is the evolutionary concept of Bayesianism.

Posterior probability is the goal we are going to maximize, so this is called Maximum A Posterior (MAP).

Assuming sampling data is i.i.d. (independent and identically distributed)

Next, for the convenience of calculation, we will take ln

Where −lnP(di∣m) has nothing to do with θ and it can be ignored:

The first item is the same as MLE:

Assuming that P(θ∣m) is a Uniform Distribution, that is , all the θ have the same probability, then P(θ∣m)=const. This term can be ignored because it has nothing to do with θ, at this time MAP=MLE.
Therefore, under the Bayesian view, MLE is only a special case of MAP, and MLE is MAP which assumes that θ has an equal probability.

But based on the above inference, we want the greater the probability which near θ=0 the better rather than all the θ have the same probability.
So, we hope that the distribution of P(θ∣m) has a finite variance and the average is close to 0. At this time, there are two distributions can be selected, Normal Distribution and Laplace Distribution.

Normal Distribution

Suppose P(θ∣m) is Normal Distribution with mean equal to 0.

Simultaneously, L2 Regularization Term is derived.

Laplace Distribution

Suppose P(θ∣m) is Laplace Distribution with mean equal to 0.

The second item is L1 Regularization Term.

Conclusion

MAP directly use Bayes’ theorem to find what θ has the greatest probability of appearing under given conditions.
MLE indirectly find a set of θ so that the probability of the Model being able to generate data becomes higher.