AI Glossary · Letter I

Inference.

The process of using a trained machine learning model to generate predictions or outputs on new input data, as opposed to training, which is the process of learning model parameters from data. Inference is what happens when an AI tool responds to a prompt, classifies a piece of content, or scores a lead, and its speed, cost, and reliability determine whether AI-powered features are viable in production applications.

Also known as model inference, AI inference, prediction serving

What it is

A working definition of inference.

Machine learning has two distinct phases: training, where a model’s parameters are learned from a labeled dataset; and inference, where the trained model is applied to new inputs to produce predictions or outputs. During inference, the model’s parameters are fixed and the computation flows only in one direction: from input, through the model’s layers, to output. There is no gradient computation, no weight update, and no use of training data. This makes inference substantially less computationally expensive than training for most model architectures, but at the scale at which production AI systems serve predictions, inference cost accumulates to the primary ongoing operational expense of an AI deployment.

Inference latency, the time from receiving an input to returning a prediction, is the primary performance constraint for real-time AI applications. User-facing applications like chatbots, content recommendation widgets, and interactive personalization systems typically require inference latency under 100 milliseconds to avoid perceptible delays. Language model inference is particularly latency-sensitive because the output must be generated token by token, with each token requiring a full forward pass through the model, producing a latency that scales with both model size and output length. Optimization techniques including model quantization, which reduces weight precision from 32-bit to 8-bit or 4-bit floats with minimal quality loss, and model distillation, which trains a smaller model to replicate the outputs of a larger one, are used to reduce inference cost and latency while preserving most of the quality of larger models.

Batch inference and real-time inference are two deployment patterns that optimize for different objectives. Real-time inference processes requests individually as they arrive, with low latency requirements but lower throughput. Batch inference processes groups of requests together on a schedule, with higher latency acceptable but better computational efficiency because GPU utilization is higher when processing many requests simultaneously. Agencies building AI-powered features need to match the inference deployment pattern to the latency requirements of the application: real-time inference for user-facing features where immediate response is required, batch inference for background scoring, content pre-tagging, and audience pre-computation where results can be computed in advance.

Why ad agencies care

Why inference might matter more in agency work than in most industries.

Every AI tool an agency uses produces its outputs through inference. The economics of AI tooling, the latency of AI-powered features in client products, and the feasibility of high-volume AI applications all depend on inference cost and performance. A working ad agency that understands inference, at the level of knowing why some AI calls are faster and cheaper than others, can make better infrastructure decisions, set realistic latency expectations, and troubleshoot performance issues in AI-powered workflows.

API pricing for AI tools is inference pricing. When an agency pays per token for a language model API or per image for a generation service, they are paying for inference compute on the provider’s hardware. The cost scales with model size, context length, and output length because these factors directly determine the amount of inference compute required. Understanding this makes API cost modeling more accurate: a workflow that generates long outputs from large models is significantly more expensive per task than one that generates short outputs from smaller models, and evaluating which model size is sufficient for the task quality requirement, rather than defaulting to the largest available, is a practical cost optimization that requires understanding inference economics.

On-premise and cloud inference decisions require comparing cost and latency tradeoffs. For agencies that run significant volumes of AI inference for client applications, the choice between API-based inference and self-hosted model inference is an economic and latency tradeoff. API inference has zero fixed cost and scales with usage, making it economical at low volumes. Self-hosted inference has fixed infrastructure cost but lower per-request cost at high volumes and enables latency optimization through hardware selection and batching strategies. Understanding the inference compute requirements of the application, in tokens per day or requests per second, is the starting point for this analysis.

Inference optimization techniques determine what is deployable within latency constraints. Model quantization, knowledge distillation, and speculative decoding are inference optimization techniques that reduce the compute cost of a forward pass without proportional quality degradation. These techniques are what make large language models deployable in production applications with real-time latency requirements: a 70-billion-parameter model with 4-bit quantization can serve responses in latency ranges that a full-precision model of the same size cannot. Knowing that these techniques exist and that they are standard practice in production AI deployment helps agencies evaluate vendor latency claims and understand the engineering tradeoffs behind AI service performance tiers.

In practice

What inference looks like inside a working ad agency.

An agency is building an AI-powered product description generator for a retail client that will be integrated into the client’s product catalog management tool, where catalog managers expect to generate a description and see it within 3 seconds of clicking “Generate.” The agency evaluates three options: a 70B-parameter language model via API that produces high-quality output but averages 8 seconds latency at the required output length; a 13B-parameter model via the same API that averages 2.5 seconds latency with slightly lower quality; and a 7B-parameter model with 4-bit quantization deployed on a single GPU instance that averages 1.1 seconds latency with acceptable quality for this use case. The agency selects the self-hosted 7B quantized model: it meets the latency requirement the 70B model fails; it costs 40% less per request than the API-hosted 13B model at the client’s projected usage volume; and the quality, while lower than the 70B model, is validated by catalog manager review to meet the client’s standards. The deployment decision is documented with explicit latency benchmarks and cost projections that justify the model selection on technical and economic grounds.

Build the AI infrastructure literacy that makes production AI features fast, affordable, and reliable through The Creative Cadence Workshop.

The automations and agents module covers how to design and deploy AI-powered features for production, including the inference optimization, infrastructure selection, and cost modeling practices that determine whether an AI idea is deployable in the real world.