AI Glossary · Letter R

ROC Curve.

A graphical tool for evaluating binary classification model performance that plots the true positive rate against the false positive rate across all possible decision thresholds, and whose area under the curve (AUC) summarizes the model’s overall discrimination ability as a single number between 0 and 1. AUC is the most commonly used threshold-independent metric for comparing classifier performance in propensity modeling, audience scoring, fraud detection, and other binary prediction tasks in marketing.

Also known as receiver operating characteristic, ROC-AUC, AUC curve

What it is

A working definition of the ROC curve.

A binary classifier outputs a score for each input, and the decision threshold determines the score above which an input is classified as positive. As the threshold is raised, fewer inputs are classified as positive: true positive rate (sensitivity, the fraction of actual positives correctly identified) decreases, and false positive rate (1 minus specificity, the fraction of actual negatives incorrectly classified as positive) also decreases. As the threshold is lowered, both rates increase. The ROC curve plots this tradeoff for every possible threshold, with true positive rate on the y-axis and false positive rate on the x-axis, tracing a curve from the bottom-left corner (threshold so high that nothing is classified positive) to the top-right corner (threshold so low that everything is classified positive).

The area under the ROC curve (AUC) provides a single-number summary of the classifier’s ability to discriminate between positive and negative examples across all thresholds. An AUC of 1.0 indicates perfect discrimination: the classifier assigns higher scores to all positive examples than to all negative examples. An AUC of 0.5 indicates no discrimination: the classifier performs no better than random chance. AUC can be interpreted probabilistically: it is the probability that the classifier assigns a higher score to a randomly chosen positive example than to a randomly chosen negative example. AUC is scale-invariant (it measures ranking quality, not calibration) and threshold-invariant (it evaluates performance across all thresholds, not at any specific threshold), making it appropriate for comparing models when the deployment threshold has not yet been determined.

AUC has important limitations. Because it averages performance across all thresholds equally, it can produce a high AUC for a model that performs poorly at the specific threshold that will be used in deployment. For highly imbalanced datasets where the positive class is rare, as in fraud detection where fraud rates may be 0.1%, AUC can appear high even for models that fail to capture most of the rare positives. Precision-recall curves and their area (PR-AUC) are better suited to imbalanced class problems because they focus on the model’s performance on the positive class specifically, rather than averaging across positive and negative class performance equally as ROC-AUC does.

Why ad agencies care

Why AUC is the standard metric for propensity and audience scoring models, and what its limitations require agencies to understand.

A working ad agency building or evaluating propensity models, churn predictors, lead scoring systems, or audience quality classifiers for clients will encounter AUC as the primary reported model performance metric. AUC is appropriate for these use cases because the deployment threshold is typically not fixed at training time: a lead scoring model might route the top 15% of leads to the outbound sales team for one client and the top 30% for another, depending on sales capacity. AUC measures the model’s discrimination quality across all possible threshold choices, making it a more useful summary statistic than accuracy at a specific threshold when the deployment threshold is variable or not yet determined.

AUC gains above 0.70 correspond to meaningful improvements in audience scoring efficiency that translate directly to campaign ROI. For a churn prediction model, moving from AUC 0.65 to 0.75 means that a campaign targeting the top 20% of scored customers for a retention intervention will capture substantially more of the actual churners in that 20% at the higher AUC. At AUC 0.65, the top 20% of the scored audience might contain 38% of actual churners; at AUC 0.75, it might contain 52%. This improvement in the concentration of genuine churners in the intervention audience directly determines the intervention’s cost-effectiveness, because more wasted contacts are eliminated and more genuine churners are reached within the same budget. Quantifying the business value of AUC improvements requires this translation from model metric to audience composition to cost-per-intervention, which is the analysis agencies should present when justifying model investment to clients.

Reporting AUC alone for imbalanced audience scoring problems misleads clients about model quality for detecting rare conversion events. A purchase propensity model for a product with a 2% base conversion rate can achieve AUC of 0.78 while still missing the majority of actual converters at any practically deployable threshold, because the imbalance between negative and positive examples makes high AUC achievable through conservative prediction. Reporting precision and recall at the actual deployment threshold, alongside AUC, provides a complete and honest characterization of model performance that includes the tradeoffs between precision (fraction of predicted converters who actually convert) and recall (fraction of actual converters captured). Precision-recall analysis at the deployment threshold is what clients actually need to understand the economics of running a campaign based on the model’s scores.

Comparing ROC curves directly reveals which of two models is superior across all threshold choices, not just at a single AUC number. When comparing two candidate models with similar AUC values, plotting their ROC curves together shows whether one model dominates the other across the full threshold range or whether one model is better in the high-precision region (upper left of the ROC curve, relevant for campaigns targeting a small high-value audience) while the other is better in the high-recall region. This visual comparison prevents the mistake of selecting a model based on aggregate AUC when a competitor model is substantially better in the specific operating region that matters for the deployment use case.

In practice

What ROC curve looks like inside a working ad agency.

An agency is building a purchase propensity model for a subscription software client to identify which free trial users are most likely to convert to paid subscriptions. The training set contains 14,200 trial users from the prior 12 months, of whom 1,890 (13.3%) converted to paid. The agency trains three candidate models: logistic regression, gradient boosted trees, and a random forest. ROC-AUC on a held-out test set: logistic regression 0.74, gradient boosted trees 0.83, random forest 0.81. The gradient boosted trees model is selected on the basis of highest AUC. However, the client’s sales team capacity limits outreach to the top 500 trial users (approximately 3.5% of the monthly trial volume). The agency evaluates the three models not only on overall AUC but on precision at the top 500 scored users: logistic regression precision at 500: 0.42, gradient boosted trees: 0.56, random forest: 0.52. The gradient boosted trees model also leads on precision at the operating threshold, confirming the selection. At 0.56 precision and 500 outreach contacts, the model identifies 280 likely converters for sales outreach per month. Without the model, the sales team would contact 500 randomly selected trial users per month and reach approximately 67 converters (13.3% base rate times 500). The model produces a 4.2x improvement in sales contact efficiency (280 converters reached versus 67 without scoring), which translates directly to 4.2 times more conversions per outreach hour from the sales team. The AUC metric correctly identified the best model; the precision at threshold analysis quantified the business value of the model choice in terms the client’s sales leadership could act on.

Build the model evaluation expertise that correctly interprets classification metrics and translates them into business value through The Creative Cadence Workshop.

The generative AI foundations module covers classifier evaluation including ROC curves, AUC, precision-recall tradeoffs, and how to choose and communicate the right evaluation metrics for audience scoring, propensity, and churn prediction models deployed in agency client campaigns.