AI Glossary · Letter L

Latency.

The elapsed time between submitting a request to an AI system and receiving its complete response. Latency determines whether an AI capability feels instantaneous or sluggish and is the primary constraint on deploying models in real-time interactive applications where user experience depends on sub-second response times.

Also known as inference latency, response time, round-trip time

What it is

A working definition of latency.

Latency is measured from the moment a request is sent to the moment the response is fully available. For a language model generating a long response, this is the time until the last token is generated; for an image classifier, it is the time until the classification label is returned. Latency has multiple components: network transit time from the client to the server and back, queue wait time if the server is handling concurrent requests, and actual compute time for the model to process the input and generate the output. All three components contribute to the total latency experienced by the end user or downstream system.

Latency is distinct from throughput, which measures how many requests a system can process per unit of time. A system can have high throughput and high latency simultaneously by processing many requests in large batches, which is efficient for bulk processing but produces slow individual response times. Conversely, a system optimized for low latency may achieve fast individual responses but handle fewer total requests per second. Production AI deployments typically specify both a latency target, usually expressed as a percentile such as the 95th or 99th percentile of response times, and a throughput requirement, and the infrastructure must be sized and configured to meet both simultaneously.

Time-to-first-token latency is a distinct and often more important metric for streaming language model responses, measuring how long the user waits before seeing any output. For interactive chat and writing assistance applications, a fast time-to-first-token makes the system feel responsive even if the full response takes several seconds to complete, because users can begin reading while generation continues. Streaming APIs that deliver output token-by-token as it is generated have become standard precisely because they improve perceived responsiveness even when total generation time is unchanged.

Why ad agencies care

Why latency constraints shape which AI capabilities are deployable in agency workflows.

A working ad agency integrating AI tools into client-facing workflows and production systems faces latency tradeoffs that determine which AI capabilities are viable in which contexts. A creative brief analysis tool that takes 45 seconds to return results is acceptable for workflow planning but unusable in a live brainstorm session. A real-time personalization model that requires sub-100-millisecond inference is viable for programmatic ad serving but may require expensive dedicated GPU infrastructure. Understanding latency constraints helps agencies select the right deployment architecture for each use case rather than discovering feasibility problems after building the integration.

Real-time bidding requires sub-100-millisecond model inference. The typical timeout in programmatic ad auctions is 80 to 120 milliseconds from bid request to bid response. A custom audience scoring or creative selection model that runs during bidding must complete its inference within that window, including network transit time and any preprocessing. This latency requirement rules out large language models and complex ensemble methods for direct integration into the bidding stack, and favors lightweight models such as gradient boosting classifiers or small neural networks that are served from low-latency inference infrastructure co-located with the bidding system. Agencies building custom bidding models need to measure and account for inference latency from the beginning of the project, not as an afterthought.

Streaming API responses transform the UX of AI-assisted creative tools. For an AI writing assistant integrated into a briefing or copywriting workflow, the difference between receiving a complete response after 8 seconds and seeing words appear progressively within 1 second is the difference between an awkward waiting experience and a fluid collaborative one. Streaming API calls that deliver partial results as they are generated have become the standard pattern for interactive AI writing tools precisely because they make latency perceptually acceptable even when total generation time is substantial.

Batch processing sidesteps latency constraints for non-real-time analysis. Many valuable AI applications do not require real-time responses. Overnight audience segmentation, weekly creative performance analysis, and bulk content classification can run as scheduled batch jobs where total processing time matters more than per-request latency. Routing AI workloads to batch processing where real-time response is not required is often the simplest way to make use of larger, more capable models without paying the infrastructure premium required for low-latency serving.

In practice

What latency looks like inside a working ad agency.

An agency is integrating an AI-powered headline suggestion tool into its copywriter’s workflow. The tool takes a brief and existing copy as input and returns five headline variations for the copywriter to review and select from. The initial implementation uses a large language model API with no streaming, delivering the complete set of five headlines after an average of 12 seconds. User testing reveals that copywriters find the wait long enough to break their flow, often switching to other tasks while waiting and losing context when the results arrive. The engineering team implements two changes: first, they switch to a streaming API call that begins displaying partial output within under 1 second; second, they restructure the prompt to generate headlines one at a time rather than as a batch, so the copywriter sees the first headline within 2 seconds and can begin evaluating it while the remaining four are generated. The perceived latency drops from 12 seconds to under 2 seconds for first results, even though the total generation time for all five headlines is unchanged. Post-implementation surveys show a significant improvement in workflow satisfaction, and usage rates increase as copywriters adopt the tool for routine headline ideation rather than treating it as a tool of last resort.

Understand AI deployment constraints including latency through The Creative Cadence Workshop.

The generative AI foundations module covers how production AI systems work, including the latency, throughput, and infrastructure tradeoffs that determine which AI capabilities are practical in client-facing and real-time applications.