A machine learning model designed to process and generate sequences of data, capturing dependencies between elements that occur at different positions or time steps. Sequence models underlie language modeling, time series forecasting, recommendation from behavioral histories, and any AI task where the order of inputs matters for the prediction or generation output.
Also known as sequential model, time series model, recurrent model
A sequence model processes ordered data where the position and context of each element relative to other elements in the sequence is informative for the task. Text is a sequence of tokens where word order determines meaning. Time series data is a sequence of measurements where temporal ordering captures dynamics. User behavioral logs are sequences of events where the order of actions captures session intent. Sequence models are architectures designed to exploit this ordering information, distinguishing them from models that treat each input independently without considering position or context.
Recurrent neural networks (RNNs) and their gated variants, Long Short-Term Memory networks (LSTMs) and Gated Recurrent Units (GRUs), process sequences by maintaining a hidden state that is updated at each time step, accumulating information from all prior elements. This recurrent processing enables RNNs to capture long-range dependencies, but the sequential nature of the computation means they cannot be parallelized across the time dimension during training, making them slow to train on long sequences. LSTMs address the vanishing gradient problem that caused early RNNs to fail to capture long-range dependencies, using gating mechanisms to selectively retain or forget information across many time steps.
Transformer models, which use self-attention rather than recurrence, have largely replaced RNNs for most sequence modeling tasks because their fully parallel computation enables much faster training on modern hardware and their self-attention mechanism explicitly represents relationships between any two positions regardless of distance. Transformers are now the dominant architecture for language modeling, machine translation, and time series forecasting. Temporal fusion transformers and other time series-specific transformer variants adapt the attention mechanism to the specific structure of time series data, incorporating calendar features, static covariates, and uncertainty quantification into the sequence modeling framework.
A working ad agency using language AI for content generation, deploying sales forecasting models for client planning, or building behavioral recommendation systems for client e-commerce properties is working with sequence models at the core of each application. Understanding that these systems are processing data as ordered sequences, that order matters and earlier context is incorporated into later predictions, helps agencies explain model behavior, diagnose failure modes, and set appropriate expectations for what these systems can and cannot do with specific types of sequential input data.
Sales and demand forecasting using sequence models requires sufficient historical time series data to learn seasonal and trend patterns, with a minimum of 2 to 3 full seasonal cycles recommended. A weekly sales forecast model trained on 8 months of data will not have seen a full annual seasonal cycle and will fail to capture holiday-period demand spikes and summer slowdowns that are not represented in its training history. Deploying such a model for annual planning without communicating this limitation produces forecasts that clients use for inventory and budget planning decisions, compounding the model’s seasonal blindspot into downstream business decisions. Sequence models for seasonal forecasting should be trained on at least 2 years of history, with 3 or more years strongly preferred for categories with strong annual seasonality.
Session sequence models that predict next actions from behavioral history require careful handling of session boundaries to prevent the model from learning spurious cross-session patterns. A recommendation model trained on continuous behavioral logs without session segmentation will learn that actions taken days or weeks apart co-occur in training examples, generating recommendations based on cross-session correlations that mix genuine long-term preference signals with coincidental proximity in the log file. Correctly segmenting behavioral logs into sessions (groups of actions within a single continuous engagement period separated by gaps longer than a session timeout threshold) and modeling within-session versus cross-session context separately produces sequence models that capture the right temporal dependencies: recency-driven within-session intent versus longer-term preference patterns.
Transformer-based language sequence models degrade gracefully when the input sequence is truncated, but the degradation pattern depends on where the truncation occurs. A language model processing a long brief or document that exceeds the context window limit truncates the input at the context boundary. If the context window is filled from the beginning of the document, the model loses everything after the cutoff point. If filled from the end, it loses the beginning. For prompts where the most important context (task instructions, the specific brief) appears at the beginning, right-truncation at the context limit is less damaging than left-truncation. Agencies building prompting pipelines that might exceed context limits should implement explicit truncation strategies that prioritize retaining the most critical context rather than relying on default truncation behavior that may silently discard the most important input content.
An agency builds a promotional calendar forecasting tool for a specialty grocery client with 220 stores across 4 regions. The tool predicts weekly unit sales for 340 top-SKU products, incorporating planned promotions, holiday calendar effects, regional weather indicators, and trailing sales history. The agency evaluates two sequence model architectures: a classical LSTM sequence model and a temporal fusion transformer (TFT). The training dataset covers 156 weeks (3 years) of weekly sales per SKU per region, totaling 4.7 million training sequences. On the held-out 13-week validation set (the most recent 13 weeks), TFT achieves mean absolute percentage error (MAPE) of 8.3% across the 340 SKUs, versus 11.7% for the LSTM and 14.2% for a classical HOLT-Winters statistical baseline. The TFT’s interpretable attention weights also reveal which historical time steps are most influential for each SKU’s forecast: for the carbonated beverage category, the model attends most to the same calendar week in the prior year and the 2-week window before major holidays, capturing annual seasonality and holiday lift patterns. For produce categories, the model attends primarily to the trailing 3-week window, reflecting the shorter demand cycles in that category. The attention pattern analysis is reported to the client as an interpretable explanation of how the forecast incorporates historical patterns, increasing client confidence in the model beyond what accuracy metrics alone would provide. The tool is deployed for the client’s quarterly promotional planning, where it replaces a manual spreadsheet process that consumed 14 hours per planning cycle and achieves comparable forecast accuracy to a dedicated forecasting analyst’s judgment-based estimates.
The generative AI foundations module covers sequence models including RNNs, LSTMs, and transformer-based sequence modeling, with applications to time series forecasting, language generation, and behavioral recommendation systems deployed in agency client work.