AI Glossary · Letter L

Long Short-Term Memory.

A type of recurrent neural network architecture that uses learned gating mechanisms to selectively retain and update information across long sequences, solving the vanishing gradient problem that prevents simple recurrent networks from learning long-range dependencies. LSTMs were the dominant architecture for sequence modeling tasks including language modeling, time-series forecasting, and speech recognition before transformer models superseded them for most applications.

Also known as LSTM, LSTM network, LSTM cell

What it is

A working definition of the LSTM.

An LSTM processes sequential data by maintaining two state vectors that carry information across time steps: a hidden state that represents the network’s current output, and a cell state that acts as a long-term memory that can persist information across many time steps. At each time step, the LSTM uses three learned gating mechanisms to determine how to update these states. The forget gate decides how much of the current cell state to retain. The input gate decides how much of the new information from the current time step to write into the cell state. The output gate decides how much of the cell state to expose as the hidden state output. Each gate is a learned sigmoid function of the current input and previous hidden state, so the network learns which information to retain, which to write, and which to output based on the patterns in its training data.

The cell state provides a gradient highway that allows gradients to flow backward through long sequences without vanishing, solving the core problem that makes simple recurrent networks unable to learn dependencies spanning many time steps. In a simple recurrent network, gradients are multiplied by the same weight matrix at each time step during backpropagation, causing them to shrink exponentially over long sequences. The LSTM cell state avoids this by allowing gradients to flow through the forget gate with minimal transformation, enabling the network to learn to preserve task-relevant information across sequences of hundreds or thousands of time steps.

LSTMs were the state-of-the-art architecture for natural language processing, time-series forecasting, and sequence-to-sequence tasks from roughly 2015 to 2018. Transformer architectures, which use attention mechanisms instead of recurrent computation, have largely replaced LSTMs for language tasks because transformers can process all time steps in parallel, enabling much faster training on modern GPU hardware. However, LSTMs retain practical advantages in applications with strict latency requirements, limited compute budgets, or sequential data with explicit local structure, and they remain widely deployed in production systems built before the transformer era.

Why ad agencies care

Why understanding LSTMs helps agencies evaluate and contextualize the AI systems they work with.

A working ad agency evaluating AI vendor claims, building time-series forecasting models, or interpreting why a legacy AI system behaves as it does will encounter LSTMs in contexts where understanding their architecture and limitations is directly useful. Many production systems in marketing technology, including bid prediction models, user behavior prediction systems, and demand forecasting tools, were built with LSTM architectures during the peak of their adoption and have not been replaced. Understanding what LSTMs are good at and where they struggle helps agencies interpret the behavior of these systems and evaluate upgrade proposals intelligently.

Campaign pacing and bid prediction systems built with LSTMs have specific failure modes at sequence boundaries. LSTM-based bid prediction models that forecast user conversion probability from click-stream sequences perform well in the middle of established user journeys but are less reliable at the beginning of new sessions, when the network’s hidden state is reset and has not accumulated enough context to make confident predictions. Agencies evaluating these systems should test performance segmented by session position, not just average performance across all positions, to understand where the model is most and least reliable.

Multivariate time-series forecasting for media mix and demand modeling sometimes uses LSTM architectures. Some media mix modeling and demand forecasting vendors use LSTM-based models that take multiple time-series inputs, such as weekly spend across channels, price, distribution, and competitive activity, and produce a sales forecast. These models can capture complex nonlinear interactions and lagged effects that traditional regression models miss. The tradeoff is interpretability: LSTM-based forecast models do not produce coefficient estimates that can be directly interpreted as channel contributions, requiring additional model explanation methods to extract actionable insights from their predictions.

Sequential user behavior modeling for personalization uses LSTM and attention architectures interchangeably. Recommendation systems that model user behavior as a sequence, predicting which content or product a user will engage with next based on their recent history, use both LSTM and transformer-based architectures depending on the system’s age and the organization’s infrastructure. The behavioral outputs of these systems are similar regardless of architecture, but transformer-based systems typically handle longer history windows more effectively. Agencies building or evaluating personalization systems should ask vendors about their sequence model architecture and the maximum sequence length they effectively use.

In practice

What LSTM looks like inside a working ad agency.

An agency inherits a demand forecasting system from a client that was built three years earlier by a previous vendor. The system uses a stacked two-layer LSTM trained on weekly sales, promotional calendar, media spend, and competitor pricing data to produce 12-week demand forecasts for each of the client’s 200 SKUs. The agency’s first task is to evaluate the system’s current performance before deciding whether to maintain, retrain, or replace it. The team extracts the model’s forecasts and compares them against actuals from the past 52 weeks using mean absolute percentage error segmented by SKU category and by season. Results show that the LSTM performs well, achieving under 8% MAPE, on stable core SKUs with consistent seasonal patterns, but significantly underperforms on promotional SKUs and new product launches where behavioral patterns change rapidly. The team traces this to the LSTM’s reliance on historical sequence context: it does not generalize well to distributional shifts caused by promotions or new product introductions with limited history. The agency recommends a hybrid approach: retain the LSTM for the stable core SKU set where it performs well, and replace the promotional and new-product forecasting modules with a gradient boosted model that uses promotional features as direct inputs rather than relying on historical sequence patterns. This targeted upgrade improves overall MAPE from 11.4% to 7.8% at lower re-implementation cost than a full architecture replacement.

Long Short-Term Memory.

A working definition of the LSTM.

Why understanding LSTMs helps agencies evaluate and contextualize the AI systems they work with.

What LSTM looks like inside a working ad agency.

Build the sequence modeling foundations that contextualize legacy and modern AI systems through The Creative Cadence Workshop.

Concepts in the LSTM’s territory.