The set of transformations applied to raw data before it is used to train a machine learning model or generate model predictions. Pre-processing converts raw inputs into the standardized, clean, consistently formatted form that models require, and its quality directly determines the quality of model outputs. Garbage in, garbage out is a shorthand for the reality that even the most sophisticated model produces unreliable results from poorly preprocessed inputs.
Also known as data preprocessing, data preparation, data cleaning
Pre-processing is the transformation pipeline that converts raw data into model-ready inputs. For structured data such as campaign performance metrics, pre-processing includes handling missing values by imputation or removal, scaling numerical features to a consistent range through normalization or standardization, encoding categorical variables as numerical representations through one-hot encoding or embeddings, and removing or flagging outliers that may corrupt model training. For unstructured data such as text, pre-processing includes tokenization (splitting text into the units the model operates on), normalization of case and punctuation, and truncation or padding to a fixed length.
The most important operational principle of pre-processing is training-inference consistency: every transformation applied to training data must be applied identically to new data at inference time, using parameters fitted on the training data rather than the new data. A feature scaling step that normalizes inputs by subtracting the mean and dividing by the standard deviation of the training data must use the training set mean and standard deviation at inference time. If the mean and standard deviation are recomputed from the new data at inference time, the model receives inputs in a different scale than it was trained on and produces incorrect predictions. Formalizing the full pre-processing sequence as a pipeline that preserves fitted parameters is the standard practice for preventing this class of training-inference inconsistency.
Missing value handling deserves particular attention in marketing datasets, where missingness is often not random. A click column with missing values in an impression log may indicate impressions that were served but not tracked, impressions from a platform that does not report click data, or impressions that occurred during a tracking outage. Imputing missing clicks with zero, the mean click rate, or a platform-specific average all make different assumptions about why the data is missing and produce different model behaviors. Understanding the mechanism of missingness, and choosing imputation strategies that match the likely cause, prevents missing data from introducing systematic biases into trained models.
A working ad agency building AI models for audience scoring, campaign performance prediction, or creative analysis will spend more time on pre-processing than on model selection, and rightly so. Practitioners with deep experience in applied machine learning consistently report that data quality and pre-processing have more impact on deployed model performance than the choice between a gradient boosted tree and a neural network. A sophisticated model trained on poorly pre-processed data will underperform a simple model trained on clean, consistently formatted data. The highest-leverage AI capability investment for an agency data team is building robust, reusable pre-processing pipelines rather than pursuing increasingly complex model architectures on mediocre data.
Impression and click data from multiple ad platforms require careful normalization before model training. An attribution model trained on data from multiple ad platforms must handle platform-specific metric definitions, reporting windows, and count methodologies. One platform reports view-through conversions with a 30-day window; another uses a 7-day window. One platform uses last-click attribution; another uses first-click. Without pre-processing that normalizes these definitional differences, the model learns platform-specific artifacts rather than genuine cross-channel effects. Pre-processing for multi-platform marketing data requires a documented data dictionary that captures the exact definition of each metric from each source and applies standardization transforms that produce comparable metrics across platforms.
Time-series pre-processing for marketing data must account for seasonality, holidays, and promotional events. A campaign performance model trained on time-series data without seasonal adjustment will confuse seasonal patterns with true campaign effects. A model trained on data that includes holiday weeks without flagging them will overestimate campaign performance during the holiday period or underestimate it in the post-holiday period depending on the campaign category. Pre-processing time-series marketing data should include calendar feature engineering that creates explicit features for day of week, month, holiday proximity, and promotional event flags, enabling the model to account for these factors as controllable features rather than unexplained variance in the target variable.
Text pre-processing for brand voice and compliance analysis affects which linguistic signals the model can detect. A brand voice classifier trained on text that has been case-normalized, punctuation-stripped, and aggressively tokenized may lose the stylistic signals it needs to detect brand voice: capitalization of brand-specific terms, punctuation patterns that signal tone, and spacing conventions that characterize a brand’s writing style are all destroyed by aggressive normalization. Pre-processing text for stylistic analysis should preserve more surface-level features than pre-processing text for semantic analysis. Understanding what features the downstream model needs determines which pre-processing transforms are appropriate and which are destructive.
An agency builds a creative performance prediction model for a retail client to predict click-through rate from creative features before campaign launch. The training dataset combines campaign performance data from 3 ad platforms and 2 years of creative production history. Pre-processing the dataset takes 3 weeks and reveals the following issues: Platform A reports impressions at the ad level; Platforms B and C report at the creative set level, requiring disaggregation to match Platform A’s granularity. Click-through rates for impressions served during two promotional events (Black Friday and a spring sale) are 3x to 6x above baseline; these weeks are flagged with a “promotional event” binary feature rather than excluded. Eighteen creative assets are missing CTR data because they were paused before accumulating 1,000 impressions; these are removed from the training set after confirming they were paused for production quality reasons rather than performance reasons, to avoid selection bias. Video creative feature extraction produces 12 features per creative; still image feature extraction produces 8 features. The union of features produces a combined feature matrix with 20% missing values for still image features in video creative rows, and vice versa. Missing values are imputed with 0 and a binary indicator column is added for each feature indicating whether the value was imputed or observed. After pre-processing, the training dataset contains 4,200 complete creative-performance records suitable for model training. A gradient boosted tree model trained on the pre-processed dataset achieves AUC 0.79 on a held-out test set. The pre-processing work accounts for 70% of the project timeline, confirming that data preparation, not model selection, is the highest-effort and highest-leverage activity in the modeling process.
The generative AI foundations module covers data preprocessing in depth including missing value handling, feature scaling, categorical encoding, time-series preparation, and the pipeline practices that maintain pre-processing consistency from training through production deployment.