AI Glossary · Letter D

Data Preprocessing.

The collection of cleaning, transformation, and feature engineering steps that prepare raw data for model training. For agencies, data preprocessing is where most of the real work in building AI models happens, and where most of the subtle mistakes that damage model quality get made.

Also known as data preparation, feature engineering, data wrangling

What it is

A working definition of data preprocessing.

Data preprocessing converts raw data into the form a machine learning model can use. This includes cleaning (removing errors and handling missing values), normalization (rescaling numeric features to comparable ranges), encoding (converting categorical variables like “campaign type” into numeric representations), and feature engineering (constructing new variables from existing ones that better represent the underlying relationships the model needs to learn).

The sequence of preprocessing steps is not neutral. Each choice about how to handle missing values, which encoding method to apply to categorical features, and what derived features to create reflects assumptions about the data and the prediction task. Those assumptions can be correct or incorrect, and the model will learn accordingly.

Feature engineering is often the highest-leverage part of preprocessing. A model trained on raw timestamps will struggle to learn weekly patterns. A model trained on a “day of week” feature derived from those timestamps will learn weekly patterns easily. Knowing which features to construct and how requires both domain knowledge and familiarity with how models represent and use information, making it a genuinely strategic activity.

Why ad agencies care

Why data preprocessing might matter more in agency work than in most industries.

Marketing data is messy in predictable ways: sparse behavioral signals, inconsistent category labels across systems, temporal patterns that raw timestamps obscure, and skewed distributions in conversion events. Agencies that understand preprocessing know how to handle these patterns correctly rather than hoping the model figures them out.

Preprocessing choices determine what the model can learn. A model cannot learn a relationship between features that were not included, cannot handle a categorical variable that was not encoded, and cannot generalize across a distribution it was not exposed to during training. The features that go in constrain the patterns that can come out.

Vendor preprocessing is often a black box. Many AI tools perform preprocessing internally before training. When those tools underperform, the preprocessing choices are often the root cause, and they are not documented. Asking vendors specifically about how they handle missing values, categorical encoding, and feature selection is a legitimate evaluation question that most agencies do not ask.

It is reproducible only if it is documented. A model deployed without documentation of its preprocessing pipeline cannot be reliably retrained when the data changes. Agencies building custom models for clients need to treat preprocessing documentation as a deliverable, not an afterthought.

In practice

What data preprocessing looks like inside a working ad agency.

An agency builds a lead scoring model for a B2B client. The raw CRM data includes a “company size” field that is filled inconsistently: some records use employee count, others use revenue, and some have text like “enterprise” or “SMB.” Without preprocessing, the model cannot use this field at all. The agency standardizes the field into a four-tier categorical variable, encodes it numerically, and constructs an interaction feature combining company size with industry vertical. The preprocessing work takes longer than the model training. That ratio is normal for well-built models on real-world client data.

Build the data fluency that makes your AI models actually work through The Creative Cadence Workshop.

The generative AI foundations module of the workshop covers how today’s models work, what they require from data, and how to choose the right approach for the data realities agencies face.