AI Glossary · Letter L

Labeled Data.

A dataset in which each example has been tagged with the correct output, making it usable for supervised learning. Labeled data is the primary input to training any model that learns from examples—and the quality and quantity of labeled data is the primary constraint on model quality for most real-world AI applications.

Also known as annotated data, training data, supervised data

What it is

A working definition of labeled data.

Labeled data is a collection of examples where each example is paired with a label indicating the correct output for that input. A set of images labeled “contains a face” or “does not contain a face” is labeled data. A set of customer interactions labeled “converted” or “did not convert” is labeled data. A set of ad headlines labeled with click-through rate bins is labeled data. The labels tell a supervised learning algorithm what it should be trying to predict, and the algorithm uses this signal to adjust its parameters until its predictions match the labels as closely as possible.

Creating labeled data requires defining what the labels mean, collecting the raw examples, and either manually annotating them or deriving labels from existing records. Manual annotation is performed by human annotators working from a labeling guide that defines each label category precisely. Derived labels come from existing structured records—for example, CRM records can provide conversion labels for historical customer interactions without additional annotation, because the conversion outcome was already recorded.

The scale of labeled data required varies enormously by task and by the type of model being trained. Fine-tuning a pre-trained model on a specific task may require only hundreds or thousands of labeled examples. Training a large model from scratch requires millions to billions. Most agency use cases involve fine-tuning or applying pre-trained models, which makes smaller labeled datasets practical—but the quality requirements remain high regardless of scale.

Why ad agencies care

Why labeled data is the competitive moat that most agencies haven’t recognized they own.

Agencies accumulate labeled data as a byproduct of doing work. Every campaign that ran and produced performance outcomes is a set of labeled training examples—creative assets labeled with click rates, audience segments labeled with conversion rates, media placements labeled with ROAS. Every lead that was qualified or disqualified is a labeled training example. Most agencies treat this historical data as a reporting archive. It is actually a supervised learning dataset that can train predictive models specific to their clients and categories.

Proprietary labeled data creates models that generic tools cannot replicate. A creative performance model trained on an agency’s own historical performance data, labeled with that agency’s clients’ conversion metrics, will outperform a generic model on that agency’s specific clients and categories. This is the mechanism through which data network effects work: more labeled data produces better models, which produce better results, which produce more labeled data. Agencies that begin systematically collecting and labeling their historical data now are building a competitive advantage that compounds over time.

Labeled data quality requires investment in labeling infrastructure. The value of historical data as labeled training data depends on whether the labels were applied consistently. Inconsistent labeling guidelines, annotation conducted by different teams under different standards, and outcome definitions that changed over time all reduce the usable portion of a historical dataset. Agencies that invest in consistent data collection and labeling standards—even before they have a specific model to train—are building the infrastructure that makes future AI advantages possible.

In practice

What labeled data looks like inside a working ad agency.

A mid-size creative agency has produced more than 3,000 digital ad campaigns over six years. The campaigns are stored in a shared drive organized by client and date, with performance metrics in separate spreadsheets that are inconsistently formatted across years and account managers. The agency decides to build a creative performance prediction system and begins by assessing the labeled data it already holds. A data audit reveals that 1,800 campaigns have usable creative assets matched to performance metrics with consistent enough outcome definitions to serve as training labels. The team standardizes the outcome definition—click-through rate relative to category benchmark—applies the label to each campaign in the usable set, and creates a structured training dataset. They use this dataset to fine-tune a vision-language model pre-trained on general image-text pairs. The resulting model predicts above-benchmark performance with 72% accuracy on held-out test campaigns—meaningfully above the 50% baseline—using only the labeled data the agency already held.

Learn to extract competitive AI advantages from data your agency already holds through The Creative Cadence Workshop.

The workshop covers how labeled data creates model advantages, how to audit and structure historical campaign data for training, and how to build a data strategy that compounds over time.