AI Glossary · Letter G

Gradient Descent.

The iterative optimization algorithm that trains nearly every neural network and many other machine learning models by repeatedly computing the gradient of the loss function with respect to the model’s weights and updating the weights in the direction that reduces the loss. Gradient descent is not a detail of model internals; it is the mechanism by which AI models learn from data, and understanding its behavior explains why model training produces the results it does.

Also known as gradient-based optimization, steepest descent, first-order optimization

What it is

A working definition of gradient descent.

Gradient descent treats model training as a search over a high-dimensional parameter space for the combination of weights that minimizes the loss function. At each training step, the algorithm computes the gradient: a vector indicating how much the loss would increase if each weight were increased slightly. The algorithm then moves the weights in the opposite direction of the gradient, a step called the weight update, scaled by the learning rate hyperparameter. Repeating this process across many training examples causes the model to progressively reduce its loss and improve its predictions. The name “descent” refers to moving downhill on the surface of the loss function, analogous to water flowing toward the lowest point of a landscape.

Pure gradient descent computes the gradient across the entire training dataset before making each update, which is computationally expensive and rarely used in practice. Stochastic gradient descent computes the gradient from a single randomly selected training example per update, which is fast but produces noisy gradient estimates. Mini-batch gradient descent, the standard approach, computes gradients over a small batch of training examples, balancing computational efficiency against gradient estimate quality. The batch size is a hyperparameter that affects training stability, convergence speed, and final model quality, and is typically set between 16 and 512 depending on the model architecture and available GPU memory.

Modern training uses adaptive variants of gradient descent that adjust the learning rate per-parameter based on the history of past gradients. Adam (Adaptive Moment Estimation) is the most widely used optimizer for deep learning: it maintains running estimates of the first and second moments of the gradient for each parameter and uses these to normalize the update step, effectively giving each parameter its own adaptive learning rate. These adaptive methods reduce the sensitivity of training to the initial learning rate choice and converge faster and more reliably than vanilla gradient descent on the non-convex loss landscapes of deep neural networks.

Why ad agencies care

Why gradient descent might matter more in agency work than in most industries.

Every time a vendor presents a trained AI model, every time an agency fine-tunes a foundation model for a client application, and every time a platform’s bidding algorithm optimizes toward a campaign objective, gradient descent is the mechanism that produced the learned behavior. A working ad agency that understands gradient descent can evaluate training quality, diagnose optimization failures, and make informed decisions about learning rate configuration, batch size, and optimizer choice when building or validating custom models.

Learning rate is the most consequential training hyperparameter. A learning rate that is too large causes the optimizer to overshoot the minimum on every step, preventing convergence. A learning rate that is too small causes training to converge so slowly that the model never reaches good performance within the available training budget. Learning rate schedules that start with a larger rate for rapid initial progress and reduce it as training converges have become standard practice. When a custom model is underperforming, learning rate misconfiguration is one of the first things to check: most training failures are not architectural but optimization failures that learning rate adjustment would fix.

Gradient vanishing explains why deep networks are difficult to train without modern techniques. In networks with many layers, the gradient signal is multiplied by weight matrices as it propagates back through the network during training. If these multiplications repeatedly shrink the signal, the gradient reaching the early layers is near zero and those layers learn almost nothing. This is the vanishing gradient problem that made training deep networks impractical before techniques like residual connections, batch normalization, and careful weight initialization became standard. Understanding this explains why these techniques appear in almost every modern architecture and why their absence in a custom architecture is a warning sign.

Loss curves are the primary diagnostic for training quality. A training loss curve that decreases smoothly indicates that gradient descent is converging normally. A curve that fluctuates wildly indicates a learning rate that is too large. A curve that plateaus early indicates underfitting, inadequate model capacity, or a learning rate that is too small. A training loss that decreases while validation loss increases indicates overfitting. Reading loss curves to diagnose training behavior is a basic quality check that any agency running model training should perform before deploying a model to production.

In practice

What gradient descent looks like inside a working ad agency.

An agency is fine-tuning a text classification model to categorize customer support tickets for a retail client into 12 intent categories. The initial training run uses the default learning rate of 5e-5 from the pre-trained model’s documentation. After 3 epochs, training loss has decreased from 2.8 to 1.4, but validation loss has stopped decreasing and is slightly increasing, suggesting the model is beginning to overfit. The agency reduces the learning rate to 2e-5 and adds a cosine learning rate decay schedule that reduces the rate further over the remaining epochs. With this adjustment, the model continues improving on the validation set through 6 epochs, reaching a final validation accuracy of 91% compared to 84% with the original learning rate and training duration. The learning rate reduction and schedule change, a 20-minute configuration change, accounts for a 7-point accuracy improvement that would otherwise have required architectural changes or more training data.

Gradient Descent.

A working definition of gradient descent.

Why gradient descent might matter more in agency work than in most industries.

What gradient descent looks like inside a working ad agency.

Build the training process literacy that catches optimization failures before they become deployment problems through The Creative Cadence Workshop.

Concepts in gradient descent’s territory.