AI Glossary · Letter S

Scaling.

In machine learning, scaling refers both to the preprocessing step that adjusts input feature magnitudes to a common range or distribution before training, and to the empirical observation that model performance improves predictably with increases in model size, training data volume, and compute. Both meanings have practical implications for agency AI work: feature scaling affects model training stability, and scaling laws determine how much larger or more data-rich an AI system needs to be to reach a target performance level.

Also known as feature scaling, normalization, scale laws

What it is

A working definition of scaling.

Feature scaling as preprocessing transforms input variables so that they occupy similar numerical ranges before being fed into a model. Without scaling, features with large numerical ranges, such as annual revenue in dollars, dominate features with small ranges, such as a binary flag, in the gradient computation during training. Two common approaches are min-max normalization, which maps each feature to the range [0, 1] by subtracting the minimum and dividing by the range, and standardization, which maps each feature to zero mean and unit variance by subtracting the mean and dividing by the standard deviation. Neural networks, gradient descent-based models, and distance-based algorithms such as k-nearest neighbors and support vector machines are sensitive to feature scale; tree-based models such as random forests and gradient boosted trees are not, because they make decisions based on relative ordering of feature values rather than their magnitude.

Scaling laws in deep learning describe the empirical relationship between model performance and the amount of compute, data, and model parameters used in training. Research on large language models has shown that test loss decreases as a power law function of compute, training tokens, and parameter count, with each of these resources contributing roughly equally to performance improvement. These scaling laws allow AI researchers and practitioners to predict how much performance improvement a given increase in model size or training data will produce, and to allocate compute budgets efficiently between training larger models versus training on more data versus training for more steps.

Infrastructure scaling refers to the engineering challenge of deploying AI systems that must serve large volumes of real-time requests reliably. A model that produces good predictions in batch evaluation may not be deployable in production without engineering investment in request routing, caching, hardware acceleration, model quantization to reduce memory footprint, and load balancing across inference servers. Infrastructure scaling is a separate concern from model scaling: a smaller model that runs at low latency at scale may be preferable to a larger model that requires 10x the compute per inference, even if the larger model has higher accuracy on the evaluation benchmark.

Why ad agencies care

Why understanding both forms of scaling prevents common training failures and unrealistic expectations about AI capability.

A working ad agency building models for client use cases needs to understand feature scaling as a practical necessity: missing it causes training failures, slow convergence, and biased model outputs that are difficult to debug without knowing to look for scale-related issues. Understanding scaling laws is important for setting realistic expectations with clients about AI capability: the performance improvements clients read about in LLM announcements reflect massive increases in compute and data that are not available in every deployment context, and smaller purpose-built models for specific tasks may not improve further without similar investment in data and compute.

Unscaled features in neural network and logistic regression models produce biased weight initialization and slow or unstable training that is often mistaken for architectural problems. A propensity model trained on features that include both a page visit count variable ranging from 0 to 850 and a normalized engagement score ranging from 0 to 1 will have gradient magnitudes that are dominated by the high-range feature, effectively ignoring the normalized feature during early training. This produces a model that appears to train normally on the high-range feature but fails to incorporate the information in the low-range feature, resulting in lower accuracy than would be achieved with both features properly scaled. Standardizing all features before training a gradient descent-based model is a 5-minute preprocessing step that prevents this class of training problems entirely.

Scaling laws explain why fine-tuned large models frequently outperform purpose-built small models for complex language tasks despite having the same task-specific training data. A fine-tuned version of a 70-billion parameter foundation model trained on a client’s product description dataset may outperform a 500-million parameter model trained from scratch on the same data, because the large model’s pre-trained representations encode a richer and more generalizable understanding of language that the fine-tuning step redirects toward the specific task. Scaling law intuitions help agencies explain to clients why using a vendor’s large foundation model API with fine-tuning is often more cost-effective than training a purpose-built model from scratch for most language tasks, even when the purpose-built model requires less inference compute once deployed.

Infrastructure scaling decisions determine whether production model deployment is economically viable at the request volumes client campaigns require. A real-time personalization model that requires 500 milliseconds per inference is unviable for a homepage product recommendation system serving 40,000 requests per minute, regardless of its accuracy. Agencies selecting or building AI systems for real-time use cases must include latency and throughput requirements in the model selection criteria, not only accuracy. Model quantization (reducing weight precision from 32-bit to 8-bit), caching of embeddings for known users, and model distillation (training a smaller model to mimic a larger model’s outputs) are engineering approaches that improve inference throughput without retraining the model from scratch.

In practice

What scaling looks like inside a working ad agency.

An agency trains a bidding value prediction model for a performance marketing client that estimates the expected conversion value for each ad auction opportunity to inform real-time bidding decisions. The input feature vector includes 18 features: conversion value from prior purchases (range: $0 to $4,200), days since last purchase (range: 0 to 365), number of prior purchases (range: 0 to 48), device type (one-hot encoded: 0 or 1 per category), browser type (one-hot encoded), time of day encoded as two cyclic features (range: -1 to 1), and normalized audience quality score (range: 0 to 1). Initial training of a neural network model with these raw features shows slow convergence and training loss that oscillates without decreasing for the first 30 epochs. The agency runs a diagnostic check and identifies that conversion value (range $0 to $4,200) and days since purchase (range 0 to 365) are dominating the gradient updates because their magnitudes are thousands of times larger than the one-hot and normalized features. Applying standard scaling (zero mean, unit variance) to all continuous features before training resolves the convergence issue: the model begins improving immediately and reaches a stable minimum within 25 epochs. Validation RMSE on conversion value prediction improves from $84 (unscaled) to $47 (scaled), a 44% improvement from the preprocessing fix alone. The agency documents the scaling step in the model pipeline code and adds an automated check that verifies feature ranges before training to catch scale-related issues in future model iterations before they reach the training run.

Build the model preparation and training expertise that prevents common preprocessing failures and produces reliable AI systems through The Creative Cadence Workshop.

The generative AI foundations module covers feature scaling and standardization as preprocessing requirements, scaling laws in deep learning, and the infrastructure scaling considerations that determine whether AI models are deployable at production request volumes.