A quantitative measure used to evaluate how well a machine learning model performs on its task. Performance metrics translate model predictions into numbers that allow practitioners to compare models, track improvements during development, set deployment thresholds, and monitor production quality. Choosing the right metric is as important as choosing the right model, because optimizing a model toward a misaligned metric produces a model that scores well but fails at the actual business objective.
Also known as evaluation metric, model metric, success metric
Every machine learning task has a set of performance metrics appropriate to its structure and goals. Classification metrics include accuracy, precision, recall, F1 score, and AUC-ROC. Regression metrics include mean absolute error, root mean squared error, and R-squared. Ranking metrics include NDCG (Normalized Discounted Cumulative Gain) and mean average precision. Language generation metrics include BLEU, ROUGE, and human preference ratings. Each metric captures a different aspect of model performance and reflects different tradeoffs: accuracy is simple but misleading for imbalanced classes; precision and recall capture the tradeoff between false positives and false negatives; AUC is threshold-independent; RMSE penalizes large errors more heavily than MAE.
The gap between a model’s offline evaluation metric and its real-world business impact is one of the most persistent challenges in applied machine learning. A model that achieves high AUC on a held-out test set may still fail to produce business value because the test set does not reflect the deployment distribution, because the metric does not capture the true cost of different error types, or because the metric is optimized for a proxy that diverges from the actual goal. Connecting offline evaluation metrics to expected business outcomes requires both careful metric selection and a process for validating that improvements in the offline metric translate to improvements in the deployed system.
Multiple metrics are typically needed to fully characterize a model’s performance profile, because different metrics reveal different aspects of model behavior. A conversion prediction model with high AUC but poor calibration will correctly rank users by conversion likelihood but produce probability estimates that are systematically too high or too low, making the probabilities unreliable for downstream budget allocation calculations. A creative safety classifier with high precision but low recall will avoid false positives (incorrectly flagging safe creative) but miss many unsafe creatives. Reporting a single metric without the full performance profile gives an incomplete picture of model behavior.
A working ad agency evaluating AI vendor claims, commissioning custom model development, or monitoring deployed AI systems needs to be fluent in performance metrics to avoid being misled by selectively reported numbers. A vendor who reports only training accuracy on a class-imbalanced dataset, only AUC without calibration metrics on a propensity model, or only BLEU scores for a copy generation tool is presenting incomplete information that may conceal significant performance gaps. Knowing which metrics to request for each model type, and why each metric matters, is the baseline analytical competency for responsible AI procurement.
Precision and recall must both be reported for any classifier applied to rare marketing events. A brand safety classifier evaluated only on accuracy on a dataset with 95% safe content achieves 95% accuracy by flagging nothing as unsafe, which is useless. Precision measures the fraction of flagged items that are genuinely unsafe; recall measures the fraction of all unsafe items that are correctly flagged. A classifier with 90% precision and 40% recall correctly flags most of what it flags, but misses 60% of all unsafe content. A classifier with 40% precision and 90% recall catches most unsafe content but generates many false positives. Neither precision nor recall alone captures the full story; both must be reported, and the tradeoff between them must be set according to the relative cost of false positives versus false negatives in the specific use case.
Calibration should be evaluated alongside discrimination for any model whose probability outputs inform budget decisions. A media mix attribution model that predicts the probability that each touchpoint contributed to a conversion may discriminate well between high-attribution and low-attribution touchpoints (high AUC) while being systematically miscalibrated, meaning its predicted probabilities are consistently higher or lower than the actual fraction of touchpoints that contribute. If these miscalibrated probabilities are used to weight touchpoints in budget allocation, the resulting allocations will be distorted by the calibration error. Calibration plots and Brier scores should be required components of the performance reporting for any model whose numeric outputs feed downstream financial decisions.
A/B test metrics must be pre-registered to prevent metric selection bias from inflating apparent results. When evaluating an AI tool against a baseline through an A/B test, the primary success metric must be specified before the test begins and cannot be changed after seeing results. Post-hoc selection of the metric on which the treatment looks best is a form of p-hacking that produces falsely positive results. Agencies running AI tool evaluations should require pre-registered primary metrics from vendors conducting their own evaluation studies, and should pre-register their own primary metrics before running internal evaluations to maintain the credibility of the results.
An agency is evaluating two audience propensity scoring vendors for a subscription media client. Both vendors provide model performance reports. Vendor A reports AUC of 0.87 on a held-out test set. Vendor B reports AUC of 0.83 with additional metrics: precision of 0.71 and recall of 0.68 at the recommended score threshold of 0.5, calibration error of 0.04 (meaning predicted probabilities are within 4 percentage points of observed conversion rates), and a lift chart showing 4.2x lift in the top decile of scored users versus the full population. The agency’s analytics lead notes that Vendor A’s higher AUC is not necessarily a better metric for the client’s use case. The client intends to use the propensity scores in two ways: ranking users for inclusion in the high-value retargeting segment (for which discrimination matters) and setting bid multipliers proportional to conversion probability (for which calibration matters). Vendor A cannot provide calibration metrics, reporting that their system does not output calibrated probabilities but only relative score ranks. Vendor B’s calibration error of 0.04 confirms that its probability outputs can be used directly in bid multiplier calculations. The agency selects Vendor B despite its lower AUC because the calibrated probability outputs align with the client’s bidding use case, and requests Vendor A to add calibration to their evaluation framework if they want to compete for similar opportunities in the future. The performance metric analysis drives the vendor selection decision more directly than the model quality difference.
The generative AI foundations module covers the full landscape of machine learning performance metrics, explains what each metric does and does not measure, and provides the evaluation framework agencies need to assess AI vendor claims and commission high-quality model development.