AI Glossary · Letter D

Data Cleansing.

The process of identifying and correcting errors, inconsistencies, and gaps in a dataset before it is used for analysis or model training. For agencies, data cleansing is the unglamorous prerequisite to every AI campaign tool that promises to learn from the client’s data.

Also known as data cleaning, data scrubbing, data quality remediation

What it is

A working definition of data cleansing.

Data cleansing addresses the reality that most real-world data has problems. Records are duplicated. Fields are blank. Formats are inconsistent: dates in three different styles, state abbreviations mixed with full names. Values are implausible, a customer age of 300 or a purchase total of negative fifty dollars. Addresses are wrong. Names contain encoding artifacts from system migrations.

Cleansing approaches vary by problem type. Duplicates are identified through record linkage algorithms and merged or removed. Missing values are imputed from related data or dropped if imputation would introduce bias. Format inconsistencies are standardized. Outliers are investigated and either corrected or flagged as legitimate anomalies for downstream handling.

The underlying principle: a model trained on dirty data learns from errors as if they were facts. A churn prediction model trained on data where 15% of customer records contain duplicate transactions will make predictions based on a systematically distorted picture of customer behavior, and no amount of model sophistication will correct for that.

Why ad agencies care

Why data cleansing might matter more in agency work than in most industries.

Agencies frequently work with client data that has been collected across years, systems, and teams with different conventions and no central governance. The first honest look at a client’s CRM or CDP data often reveals a significant gap between what the client believes the data contains and what is actually there.

Garbage in, garbage out applies harder with AI. Traditional analytics can partially compensate for dirty data through manual review and judgment. Machine learning models internalize data quality problems into their parameters. A lead scoring model trained on a CRM with 20% duplicate entries will score new leads using patterns derived in part from those artifacts, and the scores will look plausible.

Cleansing reveals the health of the client’s data practices. The types of errors found during cleansing tell the agency how the data was collected and governed. Systematic duplications suggest no deduplication at ingestion. Large volumes of missing fields suggest fields that were never required during data entry. That diagnosis is a consulting deliverable in its own right.

It is not a one-time task. New data arrives continuously with the same quality problems as before. Agencies building sustainable AI capabilities for clients need to establish ongoing data quality monitoring and enforcement, not just a one-time pre-project cleanse that degrades as soon as new records start flowing in.

In practice

What data cleansing looks like inside a working ad agency.

An agency starts an engagement to build an AI-powered personalization program for a retail client. Data ingestion of the client’s customer database reveals 23% duplicate records, address fields in four different formats, and purchase history with 8% null values. Rather than proceeding to model training, the agency produces a data audit report with error categorizations and recommended remediation approaches. The client’s internal team handles the address standardization. The agency handles deduplication using a record linkage algorithm. Imputation rules are agreed on for the null purchase values. Model training begins four weeks later on data the agency can actually trust.

Build the data practices that make AI work in your clients’ actual environments through The Creative Cadence Workshop.

The governance and disclosure module of the workshop covers the internal standards your agency needs to use AI responsibly, including how to evaluate and communicate data quality risks before they become campaign problems.