What is Quantization?

What it is

A working definition of quantization.

Neural network parameters are typically represented as 32-bit floating-point numbers during training, providing high numerical precision for gradient computations. After training, that precision is often unnecessary for inference: most of the model’s useful information is encoded in the relative magnitudes of parameters rather than their exact values. Quantization approximates each parameter using a lower-precision representation, typically 8-bit integers (INT8) that can take 256 distinct values, compared to the approximately 4 billion distinct values representable in 32-bit float. The approximation introduces a small amount of error, but the error is typically small enough to preserve model accuracy within acceptable bounds while reducing memory usage by 4x and enabling faster inference on hardware optimized for integer arithmetic.

Post-training quantization applies quantization to an already-trained model without retraining, by finding the quantization parameters (scale and zero-point values) that best map the 32-bit float weight distribution to the target integer range. This is fast and requires no training data beyond a small calibration dataset used to determine the quantization parameters. Quantization-aware training incorporates the quantization process into the training loop, training the model with simulated quantization applied during forward passes so the model learns to be robust to quantization error. Quantization-aware training typically produces higher-accuracy quantized models than post-training quantization, at the cost of a retraining step.

4-bit quantization (INT4) and binarization (1-bit weights) push compression further at greater accuracy cost. 4-bit quantization of large language models has been a particularly active area, with methods such as GPTQ and AWQ achieving 4-bit compression with near-16-bit accuracy on many language tasks. These extreme quantization approaches are primarily used for inference on consumer hardware, such as running a 7B parameter language model on a laptop with 16GB of RAM, which would be impossible with 16-bit precision but is feasible with 4-bit quantization that reduces the model’s memory footprint from approximately 14GB to approximately 3.5GB.

Why ad agencies care

Why quantization determines the cost-performance tradeoff of AI model deployment in agency production systems.

A working ad agency deploying AI models for real-time creative scoring, copy generation, or bid optimization at production scale faces a direct tradeoff between model quality (which tends to increase with model size and parameter precision) and inference cost (which increases with model size and parameter precision). Quantization is the compression tool that shifts this tradeoff favorably: a quantized model delivers most of the accuracy of the full-precision model at a fraction of the memory and compute cost. Understanding when quantization is appropriate and what accuracy tradeoffs to expect enables agencies to make informed model deployment decisions.

INT8 quantization of production inference models typically reduces inference cost by 2x to 4x with less than 1% accuracy degradation on most tasks. For agencies deploying language models or vision models at high request volume, INT8 quantization is the standard first compression step with the best cost-reduction-to-accuracy-loss ratio. A creative quality scoring model processed at 2 million inferences per month can reduce its monthly inference cost from $800 to $200 with INT8 quantization if accuracy on the held-out evaluation set drops less than 1 percentage point. For high-volume production use cases, the economics of quantization are compelling and the accuracy tradeoff is typically acceptable.

Quantized models require quality validation on the specific content types they will process in production. A quantized model that passes a general accuracy benchmark may degrade disproportionately on specific content types that are poorly represented in the benchmark. A brand voice classifier quantized from 16-bit to 8-bit precision may maintain benchmark accuracy while degrading specifically on short-copy inputs (headlines and subject lines) where the model’s numerical precision margins are smallest. Validating quantized models on a production-representative sample that includes edge cases and low-volume content types is essential for catching quantization-induced degradation before it affects client-facing outputs.

Quantization combined with pruning and distillation provides multiplicative compression for extreme inference cost reduction. Each compression technique operates on a different aspect of model complexity: pruning removes parameters, quantization reduces parameter precision, and distillation transfers knowledge to a smaller architecture. Applied in combination, these three techniques can produce models that are 20x to 50x cheaper to serve than the original full-precision model with relatively modest accuracy degradation. For agencies building AI products that must serve high request volumes with tight cost constraints, understanding the full compression toolbox and how to apply it systematically is the competency that makes ambitious AI product economics feasible.

In practice

What quantization looks like inside a working ad agency.

An agency has built a real-time copy quality scoring API that evaluates ad copy variants submitted by its copywriting teams against a 7-dimension quality rubric and returns scores within 500 milliseconds. The API uses a fine-tuned BERT-base model (110 million parameters, 16-bit float precision) that achieves 91% agreement with human expert ratings on the validation set. The API currently handles 400,000 scoring requests per month at a hosting cost of $1,240 per month. The agency projects copy volume will grow to 1.8 million requests per month in 12 months, projecting to $5,580 per month at current efficiency. The agency evaluates INT8 post-training quantization of the BERT-base model using the transformers library’s quantization utilities and a calibration dataset of 2,000 copy samples. The quantized model achieves 89.3% agreement with human expert ratings on the validation set, a 1.7 percentage point reduction. Inference latency improves from an average of 180ms to 94ms per request, and memory footprint drops from 440MB to 115MB, enabling the API server to process 3.2x more concurrent requests on the same hardware. At the projected 1.8 million monthly requests, the quantized model costs $1,740 per month versus $5,580 for the full-precision model, saving $3,840 per month. The agency accepts the 1.7 point accuracy tradeoff given the economic benefit and validates that the accuracy reduction is distributed evenly across all 7 quality dimensions rather than concentrated in any single dimension, confirming that no quality dimension is particularly compromised by the quantization. The quantized model is deployed with a fallback to the full-precision model for requests flagged as high-stakes (compliance reviews and client-facing showcase copy), ensuring that the quantization accuracy tradeoff is only applied where it is economically justified.

Quantization.

A working definition of quantization.

Why quantization determines the cost-performance tradeoff of AI model deployment in agency production systems.

What quantization looks like inside a working ad agency.

Build the model compression expertise that makes high-quality AI economically deployable at production scale through The Creative Cadence Workshop.

Quantization.

A working definition of quantization.

Why quantization determines the cost-performance tradeoff of AI model deployment in agency production systems.

What quantization looks like inside a working ad agency.

Build the model compression expertise that makes high-quality AI economically deployable at production scale through The Creative Cadence Workshop.

Concepts in quantization’s territory.