AI Glossary · Letter G

Ground Truth.

The verified, correct label or value for a training example or evaluation case, established through human annotation, authoritative data sources, or direct measurement. Ground truth is what a supervised machine learning model is trained to predict, and the quality of ground truth data is the primary ceiling on model performance: a model can only learn to be as accurate as the labels it is trained on.

Also known as gold standard labels, labeled ground truth, annotation ground truth

What it is

A working definition of ground truth.

Ground truth labels represent the authoritative answer for what a model should predict for a given input. For a classification model, the ground truth is the correct class label for each training example. For a regression model, it is the correct numerical value. For an object detection model, it is the correct bounding boxes and class labels for each object in each image. These labels are typically established through human annotation, where trained annotators review each example and assign the correct label according to a labeling guide, or through direct measurement, where the label is an observed outcome rather than a human judgment, such as whether a customer who received a promotional email made a purchase within 30 days.

Ground truth quality is not binary: labels can be more or less reliable, more or less consistent, and more or less appropriate for the model’s intended use case. Inter-annotator agreement measures how consistently different annotators assign the same label to the same example; high disagreement indicates that the labeling task is ambiguous or that the labeling guide needs refinement. Label noise, where some fraction of training labels are incorrect, degrades model performance in proportion to the noise rate and can bias models toward predicting the majority class if the noise is asymmetric across classes. Systematic labeling errors, where a consistent mistake is made for a specific type of example, produce models with predictable failure modes that only appear in production where the ground truth was systematically mislabeled during training.

For AI systems deployed in production, establishing ground truth for ongoing model evaluation requires access to outcomes that may lag prediction by days, weeks, or months. A model that predicts whether a lead will convert needs to wait for the actual conversion outcome to evaluate its predictions. A model that predicts customer churn needs to wait long enough to observe whether the customer actually churned. This evaluation latency creates a gap between model deployment and the availability of quality ground truth for monitoring, which agencies managing production AI systems need to account for in their evaluation and retraining cadences.

Why ad agencies care

Why ground truth might matter more in agency work than in most industries.

Every AI model an agency deploys was trained on ground truth data, and every evaluation of whether that model is performing well requires ground truth to compare against. A working ad agency that understands how ground truth is established, where it can be wrong, and how its quality affects model performance is better equipped to evaluate vendor models, design annotation programs, and interpret model performance metrics that are only as meaningful as the ground truth they are measured against.

Attribution data is a ground truth problem. Whether a conversion is attributed to a specific touchpoint depends on the attribution model, and different attribution models produce different ground truth labels for the same conversion event. A model trained on last-touch attribution labels learns a fundamentally different prediction target than one trained on data-driven attribution labels. When evaluating or retraining models that use conversion data as their ground truth, the attribution model used to create those labels must be understood and held consistent, because changing the attribution model changes the ground truth and therefore changes what the trained model is optimized to predict.

Human annotation quality is a solvable but often neglected problem. Agencies that use human annotation to create training labels for custom models, whether for content classification, image labeling, or intent categorization, often underinvest in annotation quality processes. Without clear labeling guides, annotator calibration, inter-annotator agreement measurement, and gold-standard validation sets, annotation projects produce labels that are internally inconsistent. The resulting model learns from noise as much as signal. A 5% investment in annotation quality processes typically produces a 10-20% improvement in model performance on the resulting task.

Production outcome data is the highest-quality ground truth available. For models that predict real business outcomes, actual observed outcomes are better ground truth than human annotations because they reflect what actually happened rather than what an annotator judged would happen. Agencies that close the feedback loop between model predictions and observed outcomes, connecting their models to the downstream data sources that record actual conversions, churn events, engagement actions, and purchase completions, accumulate the highest-quality ground truth available and can improve their models over time in ways that annotation-only programs cannot match.

In practice

What ground truth looks like inside a working ad agency.

An agency is building a content safety classifier to flag brand-unsafe user-generated content on a client’s community platform. The annotation team labels 10,000 posts as safe, borderline, or unsafe using a labeling guide developed by the agency. Quality review finds that inter-annotator agreement on the borderline category is only 52%, meaning that annotators are assigning different labels to the same borderline posts more often than not. The agency revises the labeling guide with more specific criteria and worked examples for the borderline category, runs annotator calibration sessions where all annotators label the same 100 posts and review disagreements as a group, and adds a fourth label category that separates content inappropriate for brand adjacency from content that violates community standards. After the revision, inter-annotator agreement on the revised categories rises to 81%. The classifier trained on the revised labels achieves 18% higher precision on the unsafe category compared to the classifier trained on the original inconsistent labels, with no change in the model architecture or training procedure.

Build the data quality practices that make AI model performance match its potential through The Creative Cadence Workshop.

The generative AI foundations module covers how AI models are trained and evaluated, including the ground truth quality practices that determine whether a model is learning from clean signal or from annotator noise.