Datasets too large or complex for conventional software to process, requiring distributed storage, specialized infrastructure, and purpose-built analysis tools. For agencies, big data is the foundation on which audience targeting, campaign measurement, and AI-powered personalization are built.
Also known as large-scale data, high-volume data, data at scale
Big data is conventionally characterized by volume (too much for a single machine to store or process), velocity (arriving faster than conventional tools can handle), and variety (structured records, unstructured text, images, event streams, and clicklogs arriving simultaneously). These three characteristics define where conventional database and spreadsheet tools stop working and specialized infrastructure begins.
In practice, big data infrastructure refers to the platforms and pipelines that collect, store, and process this data: data lakes, distributed query engines, real-time streaming platforms, and cloud-scale storage systems. The output of these pipelines feeds the machine learning models and analytics tools that agencies use for targeting and measurement.
AI models require large training datasets, and the quality of an AI tool is often directly constrained by the scale and quality of the data it was trained on. Foundation models, for example, are trained on datasets measured in trillions of tokens, gathered from a large portion of the documented internet. That scale is only possible because of big data infrastructure.
Every AI tool an agency uses was built on top of big data infrastructure. Audience segments, attribution models, creative scoring systems, and recommendation engines are all downstream of data pipelines that ingest, clean, and serve data at scale. Understanding the data foundations of these tools is necessary context for evaluating their reliability.
Data quality is the constraint, not data volume. Having more data does not automatically produce better AI. A large dataset with systematic gaps, outdated records, or poor labeling produces worse results than a smaller, cleaner dataset. Agencies should ask vendors about data governance practices, not just data volume claims.
First-party data strategy is becoming more important. As third-party data access declines due to privacy regulation and browser changes, agencies and their clients need first-party data strategies: structured programs to collect, store, and activate behavioral and transactional data from direct customer interactions. That requires at least a basic investment in data infrastructure.
Data access asymmetries affect what AI tools can do. Platform-owned data (search behavior, social engagement, purchase history) is not accessible to agencies in raw form. The AI tools built on that data operate inside walled gardens. Agencies should understand what data their tools actually have access to, because the answer determines what the tools can and cannot know about the target audience.
A retail client asks their agency to explain why the AI-powered audience targeting in their e-commerce campaigns outperforms the same targeting in their brick-and-mortar campaigns. The answer is data: the e-commerce environment generates clickstream, cart, and purchase event data at scale, which feeds the targeting model with dense behavioral signals. The physical retail environment generates point-of-sale transaction data and in-store app activity, which is sparser and arrives with more latency. The same AI tool performs differently because the underlying data infrastructure supporting each channel is not equivalent.
The generative AI foundations module of the workshop covers how today’s models work, what they can and can’t do, and how to choose between them.