A point in a model’s loss landscape where the training loss is lower than all immediately neighboring points but is not the globally lowest loss achievable. Getting stuck in a local minimum is a classic failure mode in neural network training, though modern large models navigate complex loss landscapes in ways that make simple local minima less problematic in practice.
Also known as local optimum, suboptimal solution, saddle point
A local minimum is a point in an optimization problem where the objective function—in machine learning, the loss function—is lower than all immediately adjacent points but is not the global minimum: the lowest possible value of the objective function across the entire parameter space. Gradient descent, the standard training algorithm for neural networks, moves parameters in the direction that decreases the loss and stops moving when the gradient reaches zero. A local minimum is a point where the gradient is zero but the loss is higher than it could be, causing gradient descent to converge to a suboptimal solution.
For small, shallow neural networks, local minima were historically a significant problem: models would train to different solutions depending on initialization, and some runs would converge to poor local minima. Modern large neural networks train in very high-dimensional parameter spaces where the geometry of the loss landscape is qualitatively different. Research has found that most critical points in high-dimensional loss landscapes are saddle points rather than true local minima, and that the local minima that do exist in such landscapes tend to have similar loss values to the global minimum. This is one reason large language models train reliably despite having billions of parameters.
Related concepts include saddle points (where the gradient is zero but the loss is neither a local minimum nor maximum, just a plateau), and the problem of getting stuck in flat regions where the gradient is nearly zero and training slows dramatically. Modern optimizers such as Adam address these issues through techniques including momentum and adaptive learning rates that allow training to escape flat regions and saddle points.
Optimization behavior affects model reliability in ways agencies observe directly. A model that trained to a poor local minimum will produce systematically suboptimal outputs. Multiple training runs on the same data that produce substantially different results signal an unstable optimization landscape. Understanding that these phenomena have a specific technical cause—optimization failure rather than data failure—helps agencies distinguish between problems that can be fixed by retraining and problems that require different data or architecture.
Model reliability across retraining runs matters for agency AI infrastructure. When an agency or a platform vendor retrains a model on new data, the new model should produce similar outputs to the old model for the same inputs—unless the new data meaningfully changes what the model should output. Significant output differences between retrained model versions can signal optimization instability: the two training runs converged to different local minima. Agencies building AI systems that will be retrained periodically should test for this instability and use techniques—such as consistent initialization, learning rate schedules, and multiple training runs—that reduce it.
Hyperparameter choices affect what local minima a model finds. The learning rate, batch size, and optimizer settings all influence which part of the loss landscape gradient descent explores and where it settles. A learning rate that is too high causes the optimizer to overshoot good solutions; one that is too low causes it to get stuck in the first local minimum it encounters. Understanding that model performance depends on these choices—not just on the data and architecture—is part of understanding why careful hyperparameter tuning matters.
An agency’s technology team trains a custom audience clustering model to segment a retail client’s customer base for personalized campaign targeting. They train the model three times on the same dataset with different random initializations and compare the resulting segmentations. Two of the three training runs produce similar cluster structures that align with the client’s intuitive understanding of their customer base: a deal-seeker segment, a brand-loyal segment, and an occasional-buyer segment. The third training run produces a qualitatively different segmentation where two of the clusters are almost indistinguishable by behavioral characteristics, suggesting the optimizer found a local minimum corresponding to a less useful partition of the data. The team discards the third run and uses the consistent segmentation from the other two, recognizing that the initialization sensitivity is a signal to use ensemble methods—training multiple models and combining their outputs—for production deployments where segmentation consistency across time matters.
The workshop covers gradient descent, optimization failure modes, and how to evaluate AI systems for the reliability properties that matter in production agency workflows.