AI Glossary · Letter O

Overfitting.

A model failure mode in which a machine learning model learns the specific patterns and noise in its training data so thoroughly that it performs poorly on new, unseen data. Overfitting produces models that appear highly accurate during training but fail to generalize, making them unreliable when deployed on the real-world data that matters most.

Also known as overtraining, model overfitting, memorization

What it is

A working definition of overfitting.

Every training dataset is a sample from a larger distribution of possible examples. A well-generalized model learns the statistical regularities that hold across the full distribution, not just the specific examples in the training sample. An overfitted model goes further: it learns idiosyncratic patterns unique to the training sample that are due to sampling noise rather than genuine signal. The textbook symptom is a large gap between training performance and validation performance: a model that achieves 97% accuracy on training data but only 72% accuracy on a held-out validation set has overfitted to the training set and will perform at the lower level when deployed on new data.

Overfitting is more likely when the model has high capacity relative to the amount of training data. A neural network with millions of parameters trained on a dataset with hundreds of examples has more than enough capacity to memorize every training example exactly, and gradient descent will do precisely that if not constrained. High-capacity models with insufficient data are particularly prone to overfitting when the training set contains any systematic biases, such as label noise, selection effects, or temporal distribution shifts, because the model will memorize these artifacts as if they were genuine signal.

The standard countermeasures are regularization, which penalizes model complexity; dropout, which randomly deactivates units during training; early stopping, which halts training when validation performance stops improving; data augmentation, which artificially expands the training set; and collecting more labeled training data. Monitoring the gap between training and validation performance throughout the training process is the empirical diagnostic that tells practitioners when overfitting is occurring and whether the countermeasures are working. Models should always be evaluated on held-out data that was not used in any stage of training, including hyperparameter selection.

Why ad agencies care

Why overfitting is the failure mode that most often causes custom-trained marketing AI to disappoint in production.

A working ad agency that fine-tunes AI models on client-specific data for brand voice classification, creative scoring, or audience propensity estimation regularly works with datasets that are small relative to the models’ capacity, creating strong overfitting risk. An agency that reports training accuracy metrics without validation accuracy metrics to clients is not demonstrating model quality; it is reporting how well the model has memorized its training examples. Genuine model quality is measured only on held-out data, and presenting training metrics as evidence of performance is a common mistake that leads to overconfident AI deployments that underperform expectations.

Creative performance prediction models overfit when trained on a single client’s limited campaign history. A creative quality scorer trained on one client’s 200 past campaign creatives with click-through rate labels has far more model capacity than is needed to fit those 200 examples, and training accuracy will quickly reach near-100% as the model memorizes which specific creatives performed well. On a held-out set of 40 new creatives, the model’s accuracy may drop sharply because it has learned the idiosyncrasies of the 200 training examples rather than the generalizable visual and copy features that drive performance. Regularization and cross-validation on small client-specific datasets are essential.

Training-validation splits must reflect the deployment distribution to produce valid overfitting diagnostics. For time-series marketing data, the validation set must consist of data from later time periods than the training set, not randomly sampled from the full dataset. A creative performance model validated on randomly sampled creatives from the same campaigns as the training data may appear well-generalized because similar campaign conditions appear in both splits, while a model validated on creatives from later campaigns correctly tests whether the learned performance signals generalize to new creative contexts.

Overfitting in media mix models produces unreliable channel contribution estimates. A media mix model with too many predictors relative to the number of observations will overfit: the model will find parameter combinations that fit the historical data well but reflect sampling artifacts rather than the true channel effectiveness relationships. The fitted coefficients will be sensitive to which specific weeks are included in the modeling window, changing significantly when a few weeks are added or removed. Cross-validation that tests model predictions on held-out time periods is the appropriate diagnostic for media mix model overfitting, and coefficient stability across different model windows is a practical check on whether the estimates are robust.

In practice

What overfitting looks like inside a working ad agency.

An agency is training a conversion propensity model for a B2B technology client to score website visitors for ad retargeting. The training dataset has 4,200 labeled visitor sessions: 340 converters (submitted a demo request) and 3,860 non-converters. The team trains an XGBoost gradient boosted tree model and evaluates its performance on a randomly sampled 20% holdout set. Training AUC is 0.94 and validation AUC is 0.81, a gap of 0.13 that indicates moderate overfitting. The team investigates the features driving the training performance and finds that three features account for much of the gap: session ID hash (a proxy for exact visitor identity that has no predictive value on new visitors), exact timestamp features at the minute level (capturing noise patterns in the training data), and URL path combinations at high granularity (fitting specific page sequences from the training set that are unlikely to recur exactly). All three features are leaking training-specific information that does not generalize. The team removes these features and reduces the tree depth from 8 to 4 levels. The retrained model achieves training AUC of 0.86 and validation AUC of 0.83, a gap of only 0.03, indicating good generalization. The slight reduction in validation AUC from 0.81 to 0.83 actually represents an improvement because the cleaner model generalizes better to the types of new visitors it will actually score in production.

Build the model evaluation expertise that distinguishes genuinely well-performing AI from memorized training data through The Creative Cadence Workshop.

The generative AI foundations module covers the bias-variance tradeoff, overfitting diagnostics, and the regularization and evaluation practices that produce machine learning models that perform reliably when deployed on real-world data.