A data preprocessing technique that increases the number of examples in underrepresented classes in an imbalanced training dataset, either by duplicating existing minority-class examples or by generating synthetic new examples. Oversampling is used when the class of interest, such as converters in a conversion prediction model, is rare relative to the majority class, causing standard training to produce models that ignore the minority class.
Also known as class oversampling, SMOTE, minority class augmentation
Class imbalance occurs when one class in a training dataset is much more frequent than another. In digital advertising conversion prediction, the conversion rate is typically 1 to 5%, producing a dataset where 95 to 99% of examples are non-converters and only 1 to 5% are converters. A model trained on such data without any adjustment will learn that predicting “non-converter” for every example produces very high accuracy, because accuracy is dominated by the majority class. The model appears highly accurate but is useless for identifying converters.
Random oversampling addresses this by duplicating randomly selected minority-class examples until the class balance is improved, typically to a 1:1 or 1:3 majority-to-minority ratio. SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic minority-class examples by interpolating between existing minority examples in the feature space: for each existing minority example, SMOTE selects a random neighbor from the minority class and creates a new example at a random point along the line segment connecting the two in feature space. SMOTE’s synthetic examples are more diverse than duplicated examples and reduce overfitting to the specific training examples, but they may create synthetic examples in feature regions between minority examples that do not correspond to real-world behavior.
Oversampling is often combined with undersampling, which reduces the number of majority-class examples, to achieve the desired class balance at a manageable total training set size. Cost-sensitive learning is an alternative that does not change the data distribution but instead weights minority-class examples more heavily in the loss function during training, achieving a similar effect to oversampling without creating duplicate or synthetic examples. Class weighting is simpler to implement than oversampling and avoids the potential for synthetic examples to introduce artifacts, making it the preferred approach for many practitioners.
A working ad agency building any model to predict rare events, including conversions, churns, fraud events, and brand safety violations, will encounter class imbalance as a standard data challenge. Without addressing imbalance, models trained on rare-event datasets default to predicting the majority class and are useless for the classification task they were built for. Understanding the available techniques for handling imbalance, and their specific tradeoffs, determines whether custom models for rare marketing outcomes produce useful predictions or misleadingly high accuracy scores that conceal complete failure on the minority class.
Conversion prediction models for rare conversion events require explicit imbalance handling to produce useful scores. A model trained on a dataset with 1% conversion rate without any imbalance handling will predict near-zero probability for almost all examples. The 1% minority class is not large enough to influence the loss function meaningfully relative to the 99% majority class. Oversampling the converter class to 10% to 20% of the training set, or applying class weights that make each converter example 10x more influential than a non-converter in the loss function, produces models that learn the distinguishing features of converters and produce a spread of predicted scores across the full 0-1 range.
The class balance ratio at training time affects the calibration of predicted probabilities at test time. A model trained with oversampling to a 50/50 class balance will predict probabilities that reflect this artificial balance rather than the true 1% conversion rate. If the model predicts 40% conversion probability for a visitor, the true probability of conversion given the original 1% base rate is much lower. Correcting for this calibration shift requires either re-calibrating the model’s probabilities using Platt scaling or isotonic regression on a held-out validation set, or using class weight adjustments that preserve the original class distribution while upweighting minority examples.
Evaluation metrics for imbalanced classification should use precision, recall, and F1 rather than accuracy. A conversion model that predicts 0 for every example achieves 99% accuracy on a 1% conversion rate dataset while having recall of 0 and being completely useless for any practical application. Precision, which measures what fraction of predicted conversions are true conversions; recall, which measures what fraction of all conversions are correctly predicted; and their harmonic mean F1 are the appropriate metrics for imbalanced classification evaluation. AUC-ROC measures discrimination ability across all possible classification thresholds and is robust to class imbalance. Agencies evaluating AI vendors who report only accuracy on imbalanced datasets should request precision, recall, and AUC metrics.
An agency is building a churn prediction model for a streaming service client with a monthly churn rate of 3.2%. The training dataset has 180,000 subscriber-months: 5,760 churns (3.2%) and 174,240 retentions (96.8%). The team trains an initial logistic regression without any imbalance handling and evaluates it on a held-out test set. The model achieves 96.7% accuracy but a recall of only 8% for the churn class: it correctly identifies only 8% of actual churns while missing 92%. The high accuracy is driven entirely by the model predicting retention for almost every subscriber. The team applies class weighting to the logistic regression, assigning a weight of 30 to each churn example and a weight of 1 to each retention example, reflecting the 1:30 class imbalance. The weighted model achieves 82% recall on the churn class with 71% precision at the default 0.5 decision threshold. Adjusting the decision threshold to 0.3 (predicting churn when probability exceeds 0.3) increases recall to 91% with precision dropping to 58%, which the team determines is the right tradeoff for the client’s retention program: the cost of a missed churn is higher than the cost of unnecessarily contacting a likely-to-retain subscriber. With class weighting and threshold adjustment, the deployed model enables the retention team to capture 91% of churns in advance, providing a commercially useful tool from the same data that produced a useless model without imbalance handling.
The generative AI foundations module covers class imbalance handling techniques including oversampling, class weighting, and appropriate evaluation metrics for rare-event prediction in churn, conversion, and fraud detection applications.