AI Glossary · Letter E

Evaluation Metrics.

The quantitative measures used to assess how well a machine learning model performs on a specific task, including accuracy, precision, recall, F1 score, AUC, and mean absolute error. For agencies, the choice of evaluation metric is not a technical afterthought: it determines what the model is penalized for during training, which errors it tolerates, and whether the resulting system actually serves the business objective it was built for.

Also known as model metrics, performance metrics, model evaluation criteria

What it is

A working definition of evaluation metrics.

Evaluation metrics translate model behavior into a single number that represents how well the model accomplishes its task. Different metrics measure different properties of model performance, and each has a specific mathematical definition that captures a particular type of error or correctness. Accuracy measures the fraction of all predictions that are correct. Precision measures, among all positive predictions, how many were actually positive. Recall measures, among all actual positives, how many the model correctly identified. These three metrics can tell very different stories about the same model depending on the distribution of the data and the relative costs of different error types.

The distinction between precision and recall is critical in practice. A lead scoring model with high precision makes fewer false predictions of high-value leads, meaning the sales team wastes less time on unqualified contacts. A model with high recall surfaces more of the actually qualified leads, meaning fewer real opportunities are missed. In most real deployments, improving one comes at the cost of the other, and the right tradeoff depends on whether missed opportunities or wasted sales effort costs more for that specific client.

AUC, the area under the receiver operating characteristic curve, measures overall discriminative ability across all possible classification thresholds and is useful for comparing models on imbalanced datasets where accuracy is misleading. For regression tasks, mean absolute error and root mean squared error measure the average magnitude of prediction errors, with RMSE penalizing large errors more heavily than MAE because it squares the differences before averaging. The choice between these depends on whether large individual errors are disproportionately costly in the target application.

Why ad agencies care

Why evaluation metrics matter more in agency work than in most industries.

Every AI tool a working ad agency uses was optimized for something, and whatever metric was used during training is what the model learned to be good at. If that metric does not align with the business objective, the model can score well on the vendor benchmark while failing at the client’s actual goal. Understanding evaluation metrics is how agencies move from accepting vendor performance claims at face value to evaluating whether the model is optimized for the right thing.

Metric misalignment is the most common source of AI tool disappointment. A brand safety classifier optimized for accuracy on a balanced benchmark dataset will perform poorly in production where unsafe content represents 2% of inputs, because accuracy-optimized models on imbalanced data learn to predict the majority class. The right metric for this use case is precision on the unsafe class. Agencies that specify required metrics in vendor evaluations rather than accepting generic accuracy claims avoid this failure mode systematically.

Different clients have different error cost asymmetries. A client in financial services who uses a fraud detection model has a very different tolerance for false positives and false negatives than a client using a content recommendation model. The metric that governs the model should reflect those cost asymmetries explicitly. Agencies building custom models for clients should document the metric choice and the business rationale for it, so the reasoning is visible when the tradeoffs are revisited during performance reviews.

Composite metrics can hide individual metric failures. F1 score, which combines precision and recall into a single number, can remain stable even as precision drops and recall rises to compensate. Reporting only composite metrics to clients obscures the directional shifts in the underlying components that often signal emerging problems. Reporting precision and recall separately, in addition to F1, provides the early warning that the composite metric would mask.

In practice

What evaluation metrics looks like inside a working ad agency.

An agency is evaluating two AI tools for a retail client’s product recommendation engine. Vendor A reports 91% accuracy. Vendor B reports 78% accuracy. The agency asks both vendors to provide precision, recall, and AUC breakdowns on a held-out test set derived from the client’s own historical purchase data rather than the vendor’s standard benchmark. On the client data, Vendor A’s accuracy advantage disappears: its high accuracy reflects correct predictions on the 85% of cases where no purchase occurred, not on the relevant cases where a purchase was made. Vendor B’s AUC on purchase-positive cases is 0.84 versus Vendor A’s 0.71. The agency selects Vendor B and documents the metric analysis in the vendor selection rationale delivered to the client.

Build the measurement fluency that evaluates AI tools on what actually matters through The Creative Cadence Workshop.

The generative AI foundations module of the workshop covers how to evaluate model performance honestly, including the metric choices that align model optimization with client business objectives rather than vendor benchmark scores.