AI Glossary · Letter M

Maximum Likelihood Estimation.

A statistical method for estimating model parameters by finding the values that make the observed data most probable under the model. Maximum likelihood estimation is the most widely used parameter estimation method in statistics and machine learning, underlying logistic regression, linear regression, neural network training, and most probabilistic model fitting.

Also known as MLE, likelihood maximization, maximum likelihood

What it is

A working definition of maximum likelihood estimation.

Maximum likelihood estimation asks: given the observed data and a statistical model that specifies how data is generated as a function of parameters, what parameter values make the observed data most probable? The likelihood function is the probability of the observed data as a function of the parameters. MLE finds the parameter values that maximize this likelihood, producing the parameter estimates under which the observed data is most probable. For most standard models, this maximization is done analytically using calculus or numerically using optimization algorithms.

MLE is the estimation method behind most familiar statistical models. In linear regression, OLS minimizes the sum of squared residuals, which is equivalent to maximizing the likelihood under a Gaussian error model. In logistic regression, the cross-entropy loss that is minimized during training is the negative log-likelihood of the observed binary outcomes under the logistic model. Training a neural network by minimizing cross-entropy loss is maximum likelihood estimation of the network’s parameters under the model’s assumed output distribution. This equivalence means that whenever a machine learning model is trained by minimizing a loss function derived from a probability model, the result is a maximum likelihood estimate.

MLE has desirable statistical properties in large samples: the estimates are consistent, converging to the true parameter values as the dataset grows; and efficient, achieving the minimum variance of any unbiased estimator. In small samples, MLE can overfit: the parameters are tuned to the specific observed data rather than the underlying true distribution, producing estimates that do not generalize well. Bayesian estimation addresses this by incorporating prior distributions over parameters that regularize the estimates toward plausible values, which is particularly important in marketing applications where sample sizes for specific audience segments or product categories may be small.

Why ad agencies care

Why understanding MLE improves interpretation of model coefficients and uncertainty in marketing analytics.

A working ad agency building or interpreting statistical models for media mix, attribution, or audience analysis benefits from understanding that most model parameters are maximum likelihood estimates with associated uncertainty. The standard errors and confidence intervals reported alongside regression coefficients quantify the uncertainty in the MLE, reflecting how much the estimate would vary across repeated samples. Models with small samples or highly correlated inputs produce MLE estimates with wide confidence intervals that should not be treated as precise point estimates when making budget allocation decisions.

The connection between MLE and loss functions explains why different loss choices produce different model behaviors. Cross-entropy loss is MLE under a Bernoulli or categorical output distribution; mean squared error is MLE under a Gaussian output distribution; mean absolute error is MLE under a Laplacian output distribution. Choosing a loss function is therefore equivalent to choosing a distributional assumption about the model’s errors. When marketing outcomes have heavy-tailed distributions, such as sales that occasionally spike dramatically, the Gaussian assumption underlying MSE loss makes the model sensitive to those spikes in ways that Laplacian-loss models are not. Understanding this connection allows agencies to make principled loss function choices rather than defaulting to MSE everywhere.

Overfitting in small-sample marketing models reflects the limits of MLE. A media mix model estimated on 52 weeks of data with 15 predictors has limited degrees of freedom, and MLE will find parameter values that fit the 52 observed data points well even if those values do not reflect the true underlying channel effectiveness. The fitted coefficients will be sensitive to individual outlier weeks and may change substantially if the modeling window is extended or shortened. Bayesian alternatives that regularize the MLE with informative priors, such as the prior knowledge that marketing channels generally have positive ROI above some minimum threshold, produce more stable and generalizable estimates from the same data.

Probabilistic model outputs from MLE are calibrated probability estimates, not certainties. A logistic regression model trained by MLE produces probability estimates that are calibrated: a prediction of 0.3 should correspond to a 30% empirical conversion rate among observations predicted at 0.3. This calibration property is a consequence of the MLE training objective and is not shared by all model types. Tree-based models and neural networks often produce outputs that are not well-calibrated probabilities without additional calibration steps. Agencies that use model outputs as probability estimates for downstream decisions such as bid calculations or resource allocation should verify calibration before relying on these outputs.

In practice

What maximum likelihood estimation looks like inside a working ad agency.

An agency is building an audience scoring model for a travel client to predict which website visitors are likely to book a flight within 7 days of their visit. The training dataset has 85,000 labeled visitor sessions from the past six months, with a positive label rate of 4.2%. The team trains a logistic regression model using MLE, minimizing cross-entropy loss on the training set. After training, the team evaluates model calibration by grouping test-set predictions into deciles by predicted probability and comparing each decile’s mean predicted probability to its empirical conversion rate. The calibration plot shows that the model is well-calibrated for mid-range predictions but underestimates conversion probability for the top decile: the model predicts an average conversion probability of 18% for the top decile, but the empirical rate in that decile is 31%. The team investigates and finds that the top decile consists largely of visitors who viewed multiple flight routes and used the price calendar feature, a behavioral combination that is underrepresented in the training data because the price calendar was only introduced three months ago. The model’s MLE was not exposed to enough examples of this high-intent behavioral pattern to estimate its coefficient accurately. The team retrains the model weighting recent sessions twice as heavily as older sessions, which increases the representation of price calendar users in the effective training set and produces better-calibrated estimates for the top-decile segment. Post-correction, the top-decile empirical conversion rate is 28% against a predicted 25%, a substantial improvement that matters for the bid calculations in the retargeting campaign that uses the scores.

Build the statistical foundations that improve model interpretation and uncertainty quantification through The Creative Cadence Workshop.

The generative AI foundations module covers the probabilistic and statistical methods underlying marketing AI including maximum likelihood estimation, calibration, and the Bayesian alternatives that address its limitations in small-sample settings.