A classification model evaluation metric that measures the proportion of actual negative examples correctly identified as negative, calculated as true negatives divided by the sum of true negatives and false positives. Specificity captures the model’s ability to avoid false alarms: a high-specificity model rarely labels negative examples as positive, making it appropriate for use cases where false positives are costly, such as brand safety classification or fraud flagging that triggers expensive human review.
Also known as true negative rate, selectivity, TNR
Specificity = true negatives / (true negatives + false positives). It measures the fraction of all actual negative examples that the model correctly classifies as negative. A specificity of 0.95 means the model correctly identifies 95% of negative examples as negative, incorrectly classifying only 5% of negatives as positive (false positives). Specificity is the complement of the false positive rate: specificity = 1 minus false positive rate. High specificity models are accurate at identifying negatives; they are selective about what they call positive, which means they flag fewer things overall and produce fewer false alarms.
Specificity trades off against sensitivity (recall, true positive rate): as a classifier’s threshold is raised, it requires stronger evidence to predict positive, which decreases sensitivity (fewer true positives are captured) but increases specificity (fewer false positives are produced). This tradeoff is captured by the ROC curve, which plots sensitivity on the y-axis and 1-minus-specificity on the x-axis across all thresholds. Points in the upper-left of the ROC space correspond to classifiers with both high sensitivity and high specificity, which is the ideal but is limited by the model’s inherent discriminative ability. The appropriate tradeoff between sensitivity and specificity depends on the relative costs of false negatives and false positives in the specific application.
In the context of binary classification for marketing applications, specificity determines the false alarm rate for the negative class. For a brand safety classifier, the negative class is “safe content” and a false positive is safe content incorrectly classified as unsafe, triggering unnecessary exclusion. For a spam filter, the negative class is “legitimate email” and a false positive is legitimate email incorrectly filtered as spam. For a fraud detection model, the negative class is “legitimate transaction” and a false positive is a legitimate transaction incorrectly flagged as fraud. In each case, the cost of the false positive (lost impression opportunity, missed legitimate email, declined legitimate purchase) determines how much specificity is required.
A working ad agency evaluating brand safety classifiers, content moderation tools, or any screening system where a false positive causes a measurable cost should use specificity as a primary evaluation metric alongside sensitivity. A brand safety classifier with 98% sensitivity but 70% specificity is blocking 30% of safe content as unsafe, representing a material loss of impression reach on inventory that was excluded incorrectly. This specificity number, not just the sensitivity or overall accuracy, is what determines the real-world cost of using the classifier.
Brand safety classifier specificity directly determines the opportunity cost of false exclusions from programmatic inventory. A brand safety system deployed to exclude unsafe inventory from a programmatic campaign applies a classifier to all available inventory and excludes the portion it labels as unsafe. If the classifier has 85% specificity, 15% of actually-safe inventory is excluded as false positives. For a client with $500,000 per month in programmatic spend, at an average $5 CPM, the campaign attempts to purchase impressions against approximately 100 million eligible impressions per month. A 15% false exclusion rate removes approximately 15 million impressions of safe inventory from the buy, inflating the effective CPM and reducing reach. Evaluating brand safety vendors on specificity against a labeled test set is more informative for budget efficiency than evaluating only on sensitivity or overall accuracy.
Content quality screening with high specificity produces pre-screening pipelines that route a small, well-targeted fraction of AI-generated content to human review rather than a large random sample. A quality screening classifier for AI-generated ad copy that achieves 90% specificity and 80% sensitivity will correctly pass 90% of acceptable copy as auto-approved and incorrectly flag 10% of acceptable copy for human review. Combined with 80% sensitivity capturing 80% of actually low-quality copy, the human review queue contains mostly genuinely problematic copy with a manageable rate of false alarm additions. If specificity falls to 70%, the false alarm rate in the review queue doubles, doubling the human review burden without increasing the number of genuine quality issues caught. Specificity, not just sensitivity, determines whether automated screening pipelines are efficient enough to justify their development cost.
Audience exclusion models that suppress ads for opted-out or undesirable audience segments require high specificity to avoid incorrectly suppressing valuable audience members. A suppression model that identifies and excludes recently churned customers, competitors’ employees, or individuals who have explicitly opted out of marketing must have high specificity (few false positives) because each false positive incorrectly suppresses a potentially valuable impression against an audience member who should have been reached. Measuring specificity for suppression models is often neglected because suppression lists are evaluated primarily on whether they correctly identify the intended exclusions (sensitivity), not on whether they incorrectly exclude members who should have been reached (specificity). Both dimensions of performance affect campaign outcomes and should be evaluated.
An agency is evaluating two brand safety classifiers, Vendor A and Vendor B, for use in a consumer packaged goods client’s programmatic display campaign. The campaign runs against approximately 4.2 billion monthly ad impressions from a demand-side platform’s full inventory. The client’s brand safety requirements exclude adult content, violence, and political controversy. The agency tests both classifiers against a labeled test set of 12,400 page-level content examples: 8,200 safe pages (negative class) and 4,200 unsafe pages (positive class, 34% prevalence). Vendor A results: sensitivity 94%, specificity 81%. Vendor B results: sensitivity 89%, specificity 93%. Vendor A flags more unsafe content (94% detection rate) but also incorrectly flags 19% of safe content as unsafe. Vendor B misses more unsafe content (89% detection rate) but incorrectly flags only 7% of safe content. The agency models the inventory impact. For 4.2 billion total impressions: Vendor A correctly excludes approximately 1.35 billion unsafe impressions (94% of the 34% unsafe portion) but incorrectly excludes approximately 551 million safe impressions (19% of the 66% safe portion). Net available impressions: 2.23 billion. Vendor B correctly excludes 1.27 billion unsafe impressions but incorrectly excludes only 202 million safe impressions. Net available impressions: 2.73 billion. Vendor B’s higher specificity produces 500 million more available safe impressions per month (22% more reach) at the cost of missing 80 million more unsafe impressions. The agency presents both scenarios to the client. Given the client’s priority on reach efficiency and the moderate severity of their brand safety requirements (standard CPG category exclusions, not highly regulated categories), the client selects Vendor B, accepting slightly higher unsafe content exposure in exchange for 22% broader reach in safe inventory.
The generative AI foundations module covers classification metrics including specificity, sensitivity, precision, recall, and ROC-AUC, and the application-specific reasoning required to select the right metric for evaluating brand safety, quality screening, and audience classification models.