Exploratory Data Analysis.
The process of summarizing, visualizing, and interrogating a dataset before formal modelling begins. Exploratory data analysis (EDA) is how practitioners develop an accurate understanding of what the data actually contains, including its distributions, anomalies, missing values, and relationships between variables, rather than assuming the data matches expectations.
Also known as EDA, data exploration, data profiling
A working definition of exploratory data analysis.
Before building a model, a practitioner needs to understand the raw material: what variables are present, what their distributions look like, how they relate to each other, where values are missing, and whether the data contains obvious errors or anomalies. EDA is the structured practice of answering these questions through statistical summaries and visualizations before any modelling decisions are made.
Common EDA steps include computing summary statistics (mean, median, variance, percentiles) for each variable; plotting distributions to identify skew, outliers, and unexpected values; creating correlation matrices to find relationships between variables; examining missingness patterns to understand whether data is missing at random or systematically; and checking class balance in classification targets to determine whether oversampling or reweighting is needed.
EDA changes the modelling plan. A variable that appears useful in a data dictionary may turn out to have 40% missing values, making it unsuitable as a feature without imputation. A target variable described as binary may have three values in the actual data, requiring a different model architecture. EDA surfaces these issues before time is spent building models that will need to be rebuilt once the data reality becomes clear.
Why EDA is the step that determines whether client data is actually usable.
Agencies increasingly work with client-provided data: CRM exports, campaign performance histories, customer behavior logs, purchase records. The quality of this data is rarely as described, and the gap between what a client thinks their data contains and what it actually contains can invalidate an entire project scope. EDA is what closes that gap before it becomes a project failure.
First-party data quality is highly variable. A retail client’s customer transaction database may nominally contain three years of purchase history, but EDA may reveal that two platforms were merged mid-period with inconsistent ID schemes, that purchase values are recorded in three different currencies without a currency column, and that 60% of records from the first year are missing email addresses. None of this is visible from the data dictionary. All of it changes what models can be built and what questions can be answered.
EDA informs the data brief before the project brief is locked. Agencies that run a lightweight EDA on client data before scoping a modelling project can produce far more accurate timelines and deliverables. A scope written without EDA often contains optimistic assumptions about data quality that produce either scope creep or client disappointment when the reality differs.
It reveals audience segment viability before modelling. When an agency proposes to build a lookalike audience model or a churn predictor, EDA on the seed audience data often reveals whether there is sufficient signal. If the high-value customer segment contains 200 examples across three years, a model will not generalize. EDA surfaces this before the modelling work begins and the budget is spent.
What EDA looks like inside a working ad agency.
An agency receives a CRM export from a financial services client and is asked to build a model predicting which customers are likely to upgrade their account tier within 90 days. Before writing any modelling code, the data team runs an EDA. They find: 22% of records have no email address (making targeting harder), the “account type” field has 14 unique values rather than the three described in the brief (requiring mapping and collapsing), the target variable (tier upgrade) has occurred for only 1.2% of records (severe class imbalance requiring resampling), and three variables flagged as strong predictors in the brief are effectively constant across 95% of records (making them useless as features). The EDA takes four hours. Without it, the team would have spent two weeks building a model on assumptions that would have required a complete restart when the data issues surfaced during feature engineering.
Learn how to scope and evaluate AI projects based on what the data actually supports through The Creative Cadence Workshop.
The data strategy module covers how to assess client data quality, identify what questions the data can and cannot answer, and build project scopes that survive contact with real data.
