The process of breaking text into the small units (tokens) that AI language models actually read. For ad agencies, tokenization is the hidden unit of cost, context, and quality behind every AI tool the team uses, even when no one on the team is thinking about it.
Also known as token, tokenisation, subword segmentation, text encoding
Tokenization is the step where natural-language text gets converted into a sequence of numerical IDs that an AI model can actually process. A token is typically a chunk somewhere between a single character and a whole word. Often a frequent word, a syllable, or a common prefix. The sentence you are reading is, to a large language model, a sequence of token IDs, not a string of letters.
Token counts matter because models charge by tokens, models have a maximum context window measured in tokens, and a poorly tokenized brand name or campaign tag can quietly break a model’s understanding. On average, one English token equals roughly four characters or about three-quarters of a word, but the average hides a lot of variation, especially in non-English text or specialized vocabulary.
Most agency teams treat tokens as something the platform handles. That’s mostly fine. Until a project hits a context limit, a budget surprises someone, or a brand name gets mangled in output. Three places tokenization quietly shapes agency work.
Cost predictability. Every API-based AI tool is priced per token. A team running heavy batch jobs (generating thousands of social variations, summarizing hundreds of transcripts) can rack up costs that look small per call but compound fast. Knowing how to count tokens turns surprise invoices into budget lines.
Context window limits. Pasting a long brief plus reference materials plus instructions into a single prompt can hit a model’s context limit, at which point the model silently drops content or fails. Understanding tokens means understanding why a prompt suddenly stops behaving.
Brand name handling. Models that tokenize aggressively can split unusual brand names into pieces and treat them as unrelated tokens. Producing output where the brand name appears mangled or inconsistent. For campaigns built around a distinctive name, this matters more than the team might realize.
A senior strategist building a long-form research prompt checks the token count before running it. Pasting into a tokenizer tool and seeing whether the prompt fits the model’s context window with room for the response. A producer running a batch-summarization job uses a cheaper, faster model for first passes and a frontier model only for outputs that need to ship. A copywriter testing a new brand name in a campaign first runs it through a tokenizer to confirm it doesn’t split awkwardly. Each step takes a minute. Each step prevents a small expensive mistake.
Tokenization is invisible until it bites. The discipline is making it visible before it does.
The generative AI foundations module of the workshop covers how tokenization shapes cost, context limits, and output quality, and how to make those forces work for your projects instead of surprising them.