410.20 b
Description: An overview of Maximum Likelihood Estimation (MLE) in deep learning, highlighting its computational advantages and its application in optimizing loss functions through statistical distance measures..
#400#ML_Engineer_Basic#410#Mathematics#410.20#Statistics#410.20 b#Maximum_Likelihood_Estimation(MLE)_and_Deep_Learning
#Maximum_Likelihood_Estimation_MLE_and_Deep_Learning
Maximum Likelihood Estimation MLE
Why we should use Log-Likelihood Estimation?
- When the number of data points is small, it's not a problem, but with numbers exceeding hundreds of millions, calculations become impossible for computers.
- When data are independent, log-likelihood converts multiplication into addition, making it computable for computers.
- With gradient descent, tracking maximum likelihood reduces complexity from O(n^2) to O(n).
- Generally, loss functions use gradient descent to optimize using the negative log-likelihood.
What if you estimate a parameter by maximum likelihood estimation from a normally distributed X?
Sample Distribution and Sampling Distribution are very different
Maximum Likelihood Estimation in Deep Learning
Loss functions are derived from the distance between the probability distribution trained by the model and the probability distribution observed in the data.
- Total Variation Distance
- Kullback-Leibler divergence KLdivergence
- Kullback-Leibler divergence (KL divergence) is an asymmetry measure used to measure the difference between two probability distributions. It is a numerical representation of how well a particular probability distribution P can be approximated by another probability distribution Q. In information theory, it is interpreted as the amount of information loss, i.e., it is used as a measure of how well Q represents
- Wasserstein Distance