A hyperparameter that controls how large a step a model takes when updating its parameters during each iteration of gradient descent training. Too large a learning rate causes training to overshoot and diverge; too small a rate causes training to converge too slowly or stall in suboptimal solutions. Finding and scheduling the right learning rate is one of the most consequential decisions in training a deep learning model.
Also known as step size, training step size
During training, a model updates its parameters by computing the gradient of the loss function with respect to each parameter and taking a step in the direction that reduces the loss. The learning rate is the scalar multiplier that determines the size of each step. If the gradient points in the direction of steepest increase in loss, the model moves in the opposite direction by an amount equal to the learning rate times the gradient magnitude. A learning rate of 0.01 means each parameter moves by 1% of its gradient value per update step; a learning rate of 0.001 means it moves by 0.1%.
The learning rate exhibits a characteristic tradeoff. A learning rate that is too large causes the model to overshoot the loss minimum on each step: the parameters oscillate around the minimum or diverge entirely, and training fails to converge. A learning rate that is too small causes the model to take tiny steps that make negligible progress per update, making training impractically slow and increasing the risk of getting stuck in shallow local minima. The optimal learning rate sits in the range where steps are large enough to make rapid progress but small enough to converge reliably to a good solution.
Learning rate schedules, which change the learning rate over the course of training, are standard practice in modern deep learning. Warmup schedules start with a small learning rate and increase it over the first few hundred or thousand training steps, preventing instability in early training when the model parameters are far from any reasonable solution. Decay schedules reduce the learning rate as training progresses, allowing the model to take large steps early when it is far from a good solution and small steps later when fine-grained refinement is needed. Cyclical learning rates alternate between high and low values, which can help the model escape shallow local minima that flat or decaying schedules get trapped in.
A working ad agency fine-tuning language models on proprietary creative data, brand voice examples, or domain-specific content needs to treat learning rate selection as a first-class concern rather than an afterthought. Fine-tuning a pre-trained model with a learning rate that is too large destroys the knowledge encoded in the pre-trained weights, a problem called catastrophic forgetting, where the model’s performance on the general task degrades severely in exchange for minimal improvement on the fine-tuning task. Fine-tuning with a learning rate that is too small produces models that are indistinguishable from the base model because the fine-tuning signal is too weak to meaningfully shift the parameters.
Fine-tuning large language models for brand voice requires learning rate ranges one to two orders of magnitude smaller than pre-training. Pre-trained language models are trained with learning rates typically in the range of 1e-4 to 1e-3. Fine-tuning those same models for a downstream task such as brand voice adaptation typically requires learning rates in the range of 1e-5 to 1e-6 to update the pre-trained representations in a targeted way without overwriting them. Using a pre-training learning rate for fine-tuning is a common mistake that produces models with degraded overall language quality, even if fine-tuning loss initially decreases rapidly.
Learning rate warmup prevents early training instability in transformer fine-tuning. The standard recipe for fine-tuning transformer-based language and image models includes a linear warmup phase where the learning rate increases from zero to the target value over the first 5 to 10 percent of training steps, followed by a linear or cosine decay to near zero over the remaining steps. This warmup-decay schedule is empirically more stable than a flat learning rate for transformer architectures and is the default configuration in most fine-tuning frameworks. Agencies adapting open-source models should use the warmup schedule from the model card or fine-tuning documentation rather than experimenting with flat rates.
Learning rate is the first hyperparameter to tune when a fine-tuning run underperforms. When a fine-tuned model fails to show expected improvement on the target task, learning rate is the most likely culprit and the least expensive to diagnose. Running three training runs at learning rates one order of magnitude apart, such as 1e-4, 1e-5, and 1e-6, and comparing validation performance after a small number of steps takes a few hours of compute and almost always identifies whether the learning rate is too high, too low, or in the right range. This simple sweep prevents wasting days of compute on a full training run with a misconfigured learning rate.
An agency is fine-tuning an open-source language model to generate product description copy that matches a luxury retail client’s distinctive brand voice. The client has provided 400 examples of approved copy from their internal creative team. The initial fine-tuning run uses the default learning rate from the base model’s pre-training configuration, 2e-4, and trains for three epochs. The resulting model generates text that superficially matches the training examples but degrades rapidly in coherence on longer outputs and occasionally produces text with uncharacteristic grammatical structures that were not present in either the base model or the training examples. This is a symptom of using too high a learning rate, which has partially overwritten the base model’s language structure representations. The team reduces the learning rate to 5e-6 with a 100-step linear warmup and trains for five epochs on the same data. The resulting model generates copy that preserves the base model’s language quality and coherence while adopting the client’s preferred vocabulary, sentence rhythm, and tonal register. Validation on held-out examples shows 71% of outputs rated as “on-brand” by the client’s creative director, compared to 34% for the initial run. The learning rate change, requiring no additional data or architectural modifications, is the single change responsible for the performance difference.
The generative AI foundations module covers how neural network training works including the learning rate and schedule choices that separate successful fine-tuning projects from expensive failures.