AI Glossary · Letter T

Token.

The basic unit of text that a language model processes: a subword segment, whole word, or character that the model’s tokenizer splits text into before encoding as numerical IDs and passing through the model. Token counts determine how much text fits in a model’s context window, how much an API call costs on per-token pricing, and how the model represents words with multiple meanings or complex morphology.

Also known as subword token, BPE token, language model token

What it is

A working definition of tokens.

Tokenization converts raw text into a sequence of integer IDs that a language model processes. Each ID corresponds to a token: a unit from the model’s vocabulary, which typically contains 32,000 to 100,000 entries. Most modern language models use subword tokenization algorithms such as Byte Pair Encoding (BPE) or Unigram Language Model (SentencePiece) that represent common words as single tokens, rare or complex words as multiple tokens, and handle out-of-vocabulary text by splitting it into smaller subword units. The word “advertising” is typically one token; the word “multicollinearity” might be split into 4 or 5 tokens; a rare technical acronym might be split into individual characters.

Token count is the measure of text length that matters for language models, not word count or character count. The ratio of tokens to words varies by text type: standard English prose is typically 1.2 to 1.4 tokens per word; code and structured data produce higher ratios because of longer identifiers; non-English text and text with unusual punctuation may produce higher ratios depending on the tokenizer’s training vocabulary. A 1,000-word document is typically 1,200 to 1,400 tokens. API calls priced on tokens (input tokens, output tokens, or total tokens) should be estimated using the tokenizer’s own count rather than word count to avoid budget surprises when text content is more token-dense than expected.

The context window of a language model, measured in tokens, determines how much text the model can process in a single API call. A context window of 128,000 tokens can accommodate roughly 90,000 to 110,000 words of English text. Tokens at the beginning and end of the context window tend to receive higher attention weight than tokens in the middle, a phenomenon with practical implications for how to structure long prompts. When input text exceeds the context window limit, it must be truncated or split across multiple API calls, with significant implications for tasks that require processing long documents or maintaining long conversation histories.

Why ad agencies care

Why token understanding is essential for budgeting AI API costs, managing context windows, and debugging language model behavior.

A working ad agency using language model APIs for content generation, document processing, or conversational AI builds is billed in tokens and constrained by token limits in every API interaction. Understanding how tokenization works, how to estimate token counts for different content types, and how token limits affect what a model can process in a single call is necessary for accurate cost estimation, capacity planning, and debugging of truncation errors that silently degrade output quality when prompts exceed context limits.

Token-based API pricing means that prompt engineering choices have direct cost implications: verbose prompts with long preambles are more expensive than concise prompts that provide the same instruction context. An agency using a language model API at $0.003 per 1,000 input tokens processes a 2,000-token system prompt plus a 500-token user prompt on every API call. Over 100,000 API calls per month, the system prompt alone costs $600. Reducing the system prompt from 2,000 to 500 tokens by removing redundant instructions and condensing examples reduces monthly input token cost by $450, a 30% reduction in API cost. Prompt cost optimization through concise instruction design is a standard practice for high-volume API deployments that is ignored in low-volume development contexts but becomes material at production scale.

Token limits on context windows determine whether long-document processing tasks require chunking strategies, which affect both cost and output quality. Processing a 50-page brand guidelines document (approximately 25,000 words, 32,000 to 35,000 tokens) against a model with a 32,000-token context window is not feasible in a single call (the document alone nearly fills the window without room for the prompt and output). Processing with a model with a 128,000-token context window allows the full document plus a substantial prompt and output within a single call. When chunking is required, the quality of question answering, summarization, and extraction tasks degrades because the model cannot see the full document context simultaneously, and maintaining coherence across chunks requires additional prompt engineering for context passing. Model selection based on context window size is a first-order decision for long-document processing use cases.

Token count visibility in API responses enables usage monitoring and anomaly detection for AI cost management at agency scale. Most language model APIs return token usage in their response (prompt tokens, completion tokens, total tokens). Logging these usage statistics per API call and per client project enables the agency to track AI usage costs at the project level, identify unexpectedly expensive call patterns (prompts that are inadvertently token-heavy due to content formatting), and allocate AI API costs accurately to client accounts. Building token usage monitoring into the API integration from the start is substantially cheaper than retrofitting usage tracking after discovering that AI costs are opaque in aggregate billing.

In practice

What token looks like inside a working ad agency.

An agency is building a document Q&A system for a pharmaceutical client that allows employees to query a 400-document regulatory compliance library using natural language questions. The documents range from 2 to 85 pages each; total corpus is approximately 1.8 million words (2.2 to 2.5 million tokens). The system architecture uses retrieval-augmented generation: each document is split into 500-token chunks with 50-token overlap, embedded using a sentence transformer, and stored in a vector database. At query time, the user’s question is embedded and the top-5 most similar document chunks are retrieved. The retrieved chunks (up to 2,500 tokens) plus the user question (typically 20 to 60 tokens) plus the system prompt (180 tokens) are concatenated and sent to the language model for answer generation. The agency’s token budget analysis: average input tokens per query = 2,750 (system prompt 180 + retrieved chunks 2,500 + question 40 + context headers 30). Average output tokens per query = 350 (answer plus citations). Total tokens per query: 3,100. At $0.003 per 1,000 input tokens and $0.015 per 1,000 output tokens: cost per query = $0.00825 + $0.00525 = $0.01350. Projected query volume: 2,000 queries per month = $27 per month in API costs. Document indexing cost (one-time): 2.4 million embedding tokens at $0.0001 per 1,000 tokens = $0.24. The agency identifies one optimization: the system prompt uses 180 tokens including a 120-token preamble explaining the Q&A format that is redundant given the retrieval context. Reducing the system prompt to 60 tokens saves 120 tokens per query, reducing cost per query to $0.01260, a 6.7% reduction. At 2,000 queries per month, this saves $1.80 per month, a minor optimization at this volume but worth implementing as query volume grows. The agency documents the token budget for the system in the handoff materials, enabling the client to forecast API costs as the system scales to additional user groups.

Build the language model API literacy that enables accurate cost estimation, context management, and production AI system design through The Creative Cadence Workshop.

The generative AI foundations module covers tokenization, context windows, token-based pricing, and the chunking and prompt optimization strategies that determine the cost and quality of language model API integrations in agency AI systems.