AI Glossary · Letter K

Knowledge Distillation.

A model compression technique that trains a smaller student model to replicate the behavior of a larger teacher model by learning from the teacher’s output distributions rather than from hard labels, producing a compact model that retains most of the teacher’s performance at a fraction of its inference cost. Knowledge distillation is how many production AI systems are made fast and cheap enough to deploy at scale.

Also known as model distillation, teacher-student training, model compression via distillation

What it is

A working definition of knowledge distillation.

In standard supervised learning, a model is trained to predict the correct hard label for each training example. Knowledge distillation instead trains the student model to match the probability distribution over all possible outputs that the teacher model produces, including the soft labels that reflect the teacher’s uncertainty and the similarity relationships the teacher has learned between different output classes. These soft labels carry more information than hard labels: a teacher that assigns 0.7 probability to class A, 0.2 to class B, and 0.1 to class C is communicating not just that A is correct but that B is more similar to A than C is, information that is lost when the label is compressed to a single correct-class indicator.

The temperature parameter in distillation controls how soft or hard the teacher’s probability distribution is when used as a training target. At temperature 1, the original softmax probabilities are used. At higher temperatures, the distribution is flattened, making low-probability classes contribute more to the training signal and transferring more of the teacher’s learned similarity structure to the student. Hinton, Vinyals, and Dean showed in 2015 that this soft target training with appropriately chosen temperature substantially outperforms training the student on hard labels alone, producing students that achieve much higher performance relative to their size than would be achieved through direct training.

Knowledge distillation produces some of the most widely used models in production AI systems. DistilBERT retains 97% of BERT’s language understanding performance at 40% smaller size and 60% faster inference through distillation. Many of the efficient language models used in production AI applications are distilled versions of larger foundation models, enabling the quality of large model reasoning to be served at the cost of a much smaller model. For agencies deploying AI features in production, understanding distillation explains why smaller models can sometimes match larger models on specific tasks and how to evaluate whether a distilled model is appropriate for a given application.

Why ad agencies care

Why knowledge distillation might matter more in agency work than in most industries.

The most capable AI models are too large and slow for many production applications. Knowledge distillation is the technique that makes these models deployable at production scale by producing compact, fast models that preserve most of the quality of their larger teachers. A working ad agency that understands distillation can evaluate distilled model options, make informed build-or-buy decisions for custom model compression, and understand why some smaller models achieve performance that their size would not predict.

Distilled models enable real-time AI features that large models cannot serve at acceptable latency. A production recommendation system that needs to serve 10,000 requests per second cannot use a 70-billion-parameter model for individual request scoring. A distilled 1-billion-parameter model that achieves 92% of the large model’s quality can serve the same request volume at 15x lower inference cost and compatible latency. Knowing that distillation exists and what quality-cost tradeoff it provides enables agencies to specify realistic requirements for AI feature inference infrastructure.

Custom distillation is an option for adapting large models to specific client tasks economically. An agency that has fine-tuned a large language model for a specific client application can distill the fine-tuned model into a smaller student optimized for that specific task. The student inherits the task-specific quality improvements from the fine-tuned teacher while being deployed at a fraction of the inference cost. This distilled, task-specific student can be deployed for high-volume client applications where the large teacher would be too expensive to run in production, making fine-tuned model quality economically accessible at production scale.

Evaluating distilled models requires task-specific rather than general-domain benchmarks. A distilled model is optimized to match its teacher on the tasks represented in the distillation training data, not on all possible tasks. General-domain benchmarks like MMLU or ARC may show a larger quality gap between a teacher and its distilled student than the gap on the specific task the student is being evaluated for deployment. Agencies evaluating distilled models should prioritize evaluation on their specific use case over general benchmark rankings, because a distilled model well-matched to the deployment task may substantially outperform its general benchmark rank would suggest.

In practice

What knowledge distillation looks like inside a working ad agency.

An agency has fine-tuned a 7-billion-parameter language model on a retail client’s product catalog and customer service conversation data to produce a customer service response generation model. The fine-tuned model produces responses that score 88% acceptable in human review, significantly better than the general-purpose model baseline of 71%. However, the inference cost of serving the 7B model at the client’s peak volume of 400 concurrent requests is $2.40 per thousand interactions, which exceeds the client’s budget for the feature. The agency distills the fine-tuned 7B teacher into a 1B parameter student model, using the teacher’s soft label outputs on 80,000 client-specific interaction examples as the distillation training signal. The distilled 1B student achieves 84% acceptable responses in human review, retaining 4 points of quality loss compared to the teacher but substantially improving on the 71% general-purpose baseline. Inference cost for the 1B student is $0.38 per thousand interactions, within the client’s budget. The agency deploys the distilled model, documenting the 4-point quality gap from the teacher as an acceptable tradeoff for a 6.3x cost reduction that makes the feature economically viable.

Build the AI deployment engineering literacy that makes high-quality models economically viable at production scale through The Creative Cadence Workshop.

The generative AI foundations module covers how AI models are compressed and optimized for production deployment, including the distillation techniques that make fine-tuned large model quality accessible at the inference costs that production applications require.