AI Glossary · Letter H

Hashing.

The process of converting input data of arbitrary size into a fixed-size output value through a deterministic mathematical function. In AI and marketing technology, hashing appears in three important contexts: privacy-preserving data matching that enables cross-platform audience activation, feature engineering for machine learning that handles high-cardinality categorical variables, and data integrity verification that ensures datasets have not been corrupted or tampered with.

Also known as hash function, cryptographic hash, feature hashing

What it is

A working definition of hashing.

A hash function takes any input, a string, a file, a data record, and produces a fixed-length output called a hash or digest. The same input always produces the same hash, and even a tiny change to the input produces a completely different hash. Good hash functions are one-way: given a hash value, it is computationally infeasible to determine the original input without knowing it. This combination of determinism and one-way transformation makes hashing useful in contexts where two parties need to verify that they are referring to the same piece of data without sharing the data itself.

In data clean rooms and privacy-preserving audience matching, hashing is used to create pseudonymous identifiers that enable matching without exposing raw personal data. An advertiser and a publisher can each hash their customer email lists using the same hash function. When the two parties share their hashed lists, they can identify which hashed values appear in both lists, indicating overlapping customers, without either party exposing the underlying email addresses to the other. This hashed match rate is the basis for audience activation in walled-garden environments where direct data sharing would violate privacy requirements or platform policies.

Feature hashing, also called the hashing trick, is a machine learning technique for representing high-cardinality categorical variables in a fixed-size feature vector. Instead of creating one binary feature for each possible category value, which becomes unmanageable when there are millions of possible values such as URLs, product IDs, or user agents, feature hashing maps each category value to a position in a fixed-size vector using a hash function. This enables efficient representation of categorical variables with arbitrary cardinality at the cost of occasional hash collisions, where two different values are mapped to the same position. In practice, the collision rate is low enough to have negligible impact on model performance when the hash space is sufficiently large.

Why ad agencies care

Why hashing might matter more in agency work than in most industries.

Privacy-preserving audience activation, data clean room integrations, and cross-platform identity resolution all rely on hashing as the foundational privacy mechanism. A working ad agency that understands how hashing works in these contexts can evaluate clean room implementations more precisely, troubleshoot hashed match rate problems, and explain to clients why hashing does and does not protect personal data in specific configurations.

Hashed match rates are not perfect identity resolution. When an advertiser hashes their customer email list and matches it against a publisher’s hashed list, the match rate depends on both audience overlap and email address format consistency. The same email address hashed in different cases, with different whitespace, or with alternative formats will produce different hash values and fail to match. Before blaming low match rates on audience overlap, agencies should verify that both parties are applying the same normalization steps: lowercasing, trimming whitespace, and standardizing email format before hashing. A 10-15% improvement in match rate from normalization is common when this is not already standardized.

Hashing does not make data anonymous, only pseudonymous. A hashed email address is not anonymous because the hash is deterministic: anyone with the original email address can hash it and find the match. Hashed identifiers should be treated as pseudonymous, not anonymous, which means they remain personal data under GDPR and CCPA and are subject to the same data handling requirements as the underlying identifiers. Agencies advising clients on clean room data governance should ensure this distinction is understood at the legal and compliance level, not just the technical level.

Feature hashing in ad tech models handles the scale of programmatic data. Programmatic advertising generates features with enormous cardinality: millions of unique publisher domains, hundreds of millions of unique user agents, and trillions of unique cookie IDs. Feature hashing is how ad tech companies represent these features at scale in machine learning models without maintaining explicit mappings from every observed value to a feature index. Understanding feature hashing helps agencies interpret model feature importance reports for models trained on programmatic data, where high-cardinality hashed features behave differently from low-cardinality structured features.

In practice

What hashing looks like inside a working ad agency.

An agency is setting up a data clean room integration between a retail client’s first-party CRM data and a streaming platform’s subscriber data to measure the overlap between the client’s loyalty program members and the platform’s subscribers for campaign planning. The initial integration produces a 12% hashed email match rate, which the platform account team says is low for a retail loyalty audience. The agency audits the hashing process and finds that the client’s CRM exports email addresses in mixed case and with inconsistent trailing spaces, while the platform normalizes all emails to lowercase and trims whitespace before hashing. The agency adds a normalization step to the client’s export process: lowercase all email addresses and trim whitespace before hashing. After the normalization fix, the match rate rises to 31%, which is consistent with the platform’s typical match rates for comparable retail loyalty audiences. The additional 19 percentage points of matched audience represents users that were always in both datasets but were failing to match due to formatting inconsistency, not actual non-overlap.

Build the data infrastructure literacy that makes privacy-preserving audience activation work correctly through The Creative Cadence Workshop.

The automations and agents module covers how to build and audit data pipelines for audience matching and clean room integrations, including the identity resolution and normalization practices that determine match rate quality.