AI Glossary · Letter S

Stochastic Gradient Descent.

An optimization algorithm that trains machine learning models by iteratively updating model parameters in the direction that reduces the loss function, using gradient estimates computed from small random subsets (mini-batches) of the training data rather than the full dataset. Stochastic gradient descent and its variants (Adam, RMSProp, AdaGrad) are the standard optimization algorithms for training deep learning models and are the engine underlying every neural network trained for marketing AI applications.

Also known as SGD, mini-batch gradient descent, online learning

What it is

A working definition of stochastic gradient descent.

Gradient descent optimization updates model parameters by computing the gradient of the loss function with respect to the parameters (the direction of steepest increase in loss) and taking a small step in the opposite direction (toward lower loss). Full-batch gradient descent computes the gradient over the entire training set at each update step, which is exact but prohibitively expensive for large datasets. Stochastic gradient descent approximates the true gradient using a single randomly selected training example at each step, which is cheap but introduces high variance in the gradient estimates. Mini-batch SGD, the standard in practice, strikes a balance by computing gradients over small batches of 32 to 512 examples, providing gradient estimates that are noisy enough to escape local optima but stable enough to converge reliably.

The learning rate is the most important hyperparameter for SGD-based optimization. Too large a learning rate causes divergence or oscillation: the parameter updates overshoot the minimum and bounce between high-loss regions. Too small a learning rate causes slow convergence: the model makes tiny progress per update, requiring many more training steps to reach a good solution. Learning rate schedules that reduce the learning rate during training (step decay, cosine annealing, warmup followed by decay) typically produce better final models than fixed learning rates, because large learning rates early in training allow rapid progress toward a good region of parameter space while small learning rates later allow fine-grained convergence within that region.

Adam (Adaptive Moment Estimation) and its variants have largely replaced vanilla SGD for training deep networks in most applications. Adam maintains per-parameter adaptive learning rates based on estimates of both the first moment (mean) and second moment (variance) of the gradients. Parameters that receive consistent gradient signals get smaller learning rates; parameters with high-variance gradient signals get larger learning rates. This adaptivity makes Adam less sensitive to the global learning rate setting than vanilla SGD, converging reliably across a wider range of learning rate values. AdamW modifies Adam to apply weight decay correctly as a regularization term rather than incorporating it into the adaptive gradient estimates, and is the standard optimizer for training large transformer models.

Why ad agencies care

Why optimization algorithm choice and learning rate settings affect the quality and cost of training AI models for agency applications.

A working ad agency that fine-tunes pre-trained models for client-specific tasks, trains custom scoring models on client data, or manages AI vendor relationships where training cost is a factor needs to understand optimization at a sufficient level to make informed decisions about these processes. Training cost is a direct function of optimization efficiency: a well-tuned optimizer with an appropriate learning rate schedule converges in fewer steps, reducing GPU time and inference cost. A poorly configured optimizer that requires 3 times as many steps to converge costs 3 times as much to train, and may produce a worse final model if it runs out of budget before convergence.

Learning rate is the single hyperparameter most likely to cause training failure and the first thing to diagnose when a model is not learning. A training loss that does not decrease across epochs almost always indicates a learning rate that is too large (oscillating or diverging loss) or too small (nearly zero gradient updates). Learning rate range tests, which train a model briefly with an exponentially increasing learning rate and identify the range where loss decreases fastest, provide a principled starting point for learning rate selection rather than relying on manual tuning. For most agency fine-tuning tasks on pre-trained models, learning rates in the range of 1e-5 to 1e-4 with a warmup schedule are reliable starting points, with lower rates preferred for fine-tuning large pre-trained models to avoid destroying pre-trained representations.

Batch size selection for mini-batch SGD affects both training speed and generalization quality. Larger batches (512 to 4096 examples) provide more accurate gradient estimates and can exploit parallel GPU computation more efficiently, producing faster wall-clock training time. However, large-batch training tends to converge to sharper minima that generalize worse than the flatter minima found by small-batch training, an empirical observation known as the large-batch training problem. For most agency fine-tuning tasks on modest-sized datasets, batch sizes of 16 to 64 are standard, with learning rate scaling (linear or square-root scaling of the learning rate proportional to batch size) applied when batch size is increased for computational efficiency.

Gradient clipping prevents training instability from exploding gradients in recurrent and transformer models trained on long sequences. Transformer models training on long document sequences or multi-turn conversation examples occasionally produce extremely large gradient magnitudes for specific parameter updates, which cause the loss to spike catastrophically. Gradient clipping, which scales the gradient vector when its L2 norm exceeds a threshold (typically 1.0 or 5.0), prevents these individual large updates from disrupting the learned representations accumulated over many prior training steps. Clipping is a standard component of transformer training recipes and is enabled by default in most training frameworks; disabling it for long-sequence transformer training reliably causes periodic training instability spikes that are difficult to diagnose without knowing to look for exploding gradients.

In practice

What stochastic gradient descent looks like inside a working ad agency.

An agency is fine-tuning a pre-trained language model on a client’s 5-year archive of 4,200 approved ad copy examples to create a brand voice model for automated copy pre-screening and generation. The training setup uses a 125-million parameter pre-trained model, a batch size of 16, and the AdamW optimizer with default hyperparameters. Initial training at the default learning rate of 1e-4 shows a training loss that oscillates and fails to decrease below the initial value across 3 epochs, a diagnostic consistent with a learning rate that is too high for fine-tuning a pre-trained model (the gradient updates are overwriting pre-trained representations before the model can adapt). The agency reduces the learning rate to 2e-5 with a linear warmup over the first 10% of training steps. Training loss decreases smoothly across 10 epochs. Validation loss on a held-out 420-example set reaches its minimum at epoch 7 and increases slightly in epochs 8 through 10, indicating the beginning of overfitting. The agency adds early stopping with patience of 2 epochs and applies gradient clipping at norm 1.0. Final model: trained for 7 epochs at 2e-5 learning rate with warmup, batch size 16, AdamW, gradient clipping at 1.0. Brand voice alignment evaluation by the client’s creative team on 100 held-out examples: 87% rated as on-brand, compared to 58% for the same model before fine-tuning and 73% for a rule-based brand voice filter previously in use. Fine-tuning cost: 3.5 GPU-hours at a cloud provider rate of $0.85 per GPU-hour, total training cost of $2.98. The optimization configuration decisions (learning rate reduction, warmup, early stopping, gradient clipping) together reduced total training time by 40% and final model quality improved by 14 percentage points versus the initial misconfigured run.

Stochastic Gradient Descent.

A working definition of stochastic gradient descent.

Why optimization algorithm choice and learning rate settings affect the quality and cost of training AI models for agency applications.

What stochastic gradient descent looks like inside a working ad agency.

Build the optimization fundamentals that enable cost-effective model training and fine-tuning for agency AI applications through The Creative Cadence Workshop.

Concepts in stochastic gradient descent’s territory.