The process of converting raw data such as text, images, or categorical variables into numerical vectors that machine learning models can process. Vectorization is the essential preprocessing bridge between the real-world inputs that agencies work with, including ad copy, product descriptions, customer reviews, and behavioral logs, and the numerical representations that AI models require. The choice of vectorization method determines what information is preserved and what is lost when converting raw data to numbers.
Also known as text vectorization, feature vectorization, encoding
Vectorization converts non-numerical data into numerical vectors by applying a defined mapping from the original data space to a vector space. For text, the simplest vectorization is bag of words: count how many times each word in the vocabulary appears in the document, and represent the document as a vector of word counts. TF-IDF vectorization weights each word count by the inverse of how often the word appears across all documents, down-weighting common words that carry little discriminative information. These sparse count-based representations treat text as an unordered bag of words and cannot capture word order, grammar, or semantics beyond word co-occurrence.
Dense embedding vectorization replaces sparse count vectors with compact learned vectors that capture semantic relationships. Word2Vec and FastText learn dense word vectors where semantically related words have similar vectors. Sentence transformers and BERT-based encoders produce sentence-level dense vectors where sentences with similar meaning have similar vectors regardless of whether they share any words. These dense vectorizations are substantially more powerful than bag-of-words for tasks requiring semantic understanding, at the cost of requiring a pretrained model to compute the embeddings rather than simply counting word occurrences.
Categorical feature vectorization converts discrete category labels into numbers. One-hot encoding creates a binary indicator vector with one position per category, with a 1 in the position corresponding to the present category and 0s elsewhere. This is the lossless representation of categorical information but becomes impractically high-dimensional for categories with thousands of values. Target encoding replaces each category value with the mean of the target variable for examples in that category, producing a single number per category that captures the category’s relationship to the outcome. Embedding layers in neural networks learn low-dimensional dense representations of categorical features from data, discovering which categories are similar in terms of their relationship to the prediction target.
A working ad agency training models on text data such as ad copy, customer reviews, support tickets, and survey open-ends must choose a vectorization method before any modeling can occur. That choice determines what signal the model can learn: bag-of-words vectorization preserves keyword presence but discards word order and semantic relationships; embedding-based vectorization preserves semantic similarity but requires a pretrained encoder. Getting vectorization right is a prerequisite for building models that can capture the relevant patterns in marketing text data, and getting it wrong produces models that fail not because of algorithmic limitations but because the input representation does not contain the information needed to distinguish the target outcomes.
Sentence embedding vectorization of ad copy enables semantic similarity matching that keyword-based approaches miss. A model that predicts ad copy performance from TF-IDF bag-of-words vectors will learn which specific words co-occur with high performance, but will fail to generalize to new copy that expresses the same message with different words. A model trained on sentence embedding vectors will learn the semantic characteristics that predict performance, generalizing to new copy that is semantically similar to historical high performers even when no specific words overlap. For creative performance prediction at scale, where new copy may use novel phrasings and vocabulary, embedding-based vectorization produces more generalizable models than keyword-based vectorization.
Categorical feature vectorization of high-cardinality fields such as publisher domain and keyword text requires target encoding or embedding layers rather than one-hot encoding. A conversion prediction model that includes publisher domain as a feature faces a categorical feature with potentially tens of thousands of unique values. One-hot encoding this feature would add tens of thousands of binary columns, most of which have extremely few training examples and produce unreliable estimates. Target encoding replaces each domain with the historical conversion rate for that domain, reducing the tens of thousands of columns to a single numeric feature while retaining the relevant information about domain-level conversion rate variation. Alternatively, an embedding layer learns a dense 16- or 32-dimensional representation of each domain, capturing which domains behave similarly without requiring the user to specify the similarity a priori.
Consistent vectorization pipelines that apply the same transformation to training and production data are required to prevent deployment drift. A vectorization pipeline trained on historical data, such as a TF-IDF vectorizer fit on historical ad copy or a categorical encoder fit on historical domain values, will encounter previously unseen terms and categories in production data. An out-of-vocabulary term in TF-IDF vectorization produces a zero count in all known word dimensions; an unseen domain in one-hot encoding is typically dropped to the all-zeros vector. These zero-vector representations of novel inputs can cause models to produce incorrect predictions for inputs that fall outside the training vocabulary. Monitoring the out-of-vocabulary rate in production inputs and periodically refit the vectorizer on updated vocabulary are the maintenance practices that keep vectorization pipelines aligned between training and deployment.
An agency builds a creative quality classifier for a travel client that must automatically screen AI-generated destination copy for three quality criteria: destination accuracy (copy accurately describes the destination rather than hallucinating false attributes), brand voice match (copy matches the adventurous, aspirational brand tone), and regulatory compliance (copy avoids prohibited superlatives and guarantee language). Three separate binary classifiers are trained, one per criterion, each requiring a vectorization strategy appropriate to its classification task. For destination accuracy, the agency uses a TF-IDF vectorizer trained on 8,000 approved destination description examples, because accuracy classification depends on the presence or absence of specific verifiable terms (city names, landmark names, correct climate descriptors) that sparse keyword-count representations capture well. Validation F1 for accuracy classification is 0.84 with TF-IDF. For brand voice matching, a sentence transformer embedding is used because brand voice is a semantic property that cannot be captured by keyword frequencies alone; different sentences can express the adventurous tone with entirely different vocabulary. Validation F1 for voice matching is 0.79 with sentence embeddings versus 0.61 with TF-IDF, confirming that semantic vectorization is necessary for this task. For compliance classification, the agency again uses TF-IDF because compliance violations depend on specific prohibited phrases (guaranteed, best in the world, perfect) that keyword-presence vectors detect reliably; sentence embeddings produce lower compliance classification accuracy (0.71) than TF-IDF (0.88) because semantic representations can obscure the exact phrase boundaries that trigger compliance failures. The three-classifier system uses the appropriate vectorization for each task rather than a single uniform approach, producing a combined pre-screening system that achieves better performance than any single vectorization method could achieve across all three quality dimensions simultaneously.
The generative AI foundations module covers vectorization methods including bag-of-words, TF-IDF, word and sentence embeddings, and categorical encoding strategies, and how vectorization choice determines what information is available to machine learning models trained on marketing text and behavioral data.