The information an AI agent or language model retains and actively uses within a single task or conversation, analogous to human short-term memory: the facts, instructions, prior steps, and intermediate results held in the context window during execution. Working memory is the mechanism that allows AI agents to maintain coherent behavior across multi-step tasks, remember client instructions given at the start of a session, and accumulate results from tool calls into a growing context that informs subsequent decisions.
Also known as in-context memory, agent state, active context
In AI systems, working memory refers to the content of the context window: everything the model currently has available to attend to when generating its next response or action. This includes the initial prompt and instructions, the history of the conversation or task execution, the results returned by tools or external systems, and any documents or data provided as context. The model’s output at any step is conditioned on all this content simultaneously through the attention mechanism. The context window is both the capacity and the content of working memory: everything within it is active and accessible, while everything outside it is inaccessible.
AI agent working memory differs from long-term or external memory in that it is temporary: it exists only for the duration of the current task execution and is discarded when the session ends. Long-term memory in AI systems is stored externally, in databases, files, or retrieval systems, and is only accessible when explicitly retrieved into the context window. The distinction between working memory (in-context, active) and long-term memory (stored, must be retrieved) is the architectural basis for designing AI agent systems that can both reason within a task and build on knowledge accumulated across past tasks.
Working memory capacity is bounded by the context window length of the underlying model. When a task’s accumulated context exceeds the context window, earlier content is truncated or summarized to make room for new content, which can cause the model to lose access to information it received early in the task. Managing working memory in long-running agent tasks requires strategies such as progressive summarization (compressing earlier content to free capacity), selective retrieval (pulling only the most relevant prior content into the current context), and state externalizing (writing intermediate results to external storage and retrieving them when needed rather than retaining them in the context continuously).
A working ad agency deploying AI agents for multi-step tasks such as campaign brief generation, competitor analysis, content production pipelines, or account research workflows needs to design those agents with working memory constraints in mind. An agent asked to research 20 competitors, synthesize findings, and produce a strategy brief will accumulate more context than many current model context windows can hold without truncation. Understanding working memory limits prevents the common failure mode of an agent that appears to have understood complex instructions at the start of a task but produces output that ignores those instructions because they have been scrolled out of the active context window by accumulated tool results.
Placing critical instructions at both the beginning and the end of the working memory context reduces instruction loss from context overflow in long agent tasks. Large language models exhibit recency bias in their attention to context: content near the end of the context window receives more weight than content far from the current generation point. For long agent tasks where early instructions may be pushed far from the active generation point by accumulated intermediate results, repeating key constraints and objectives just before the final synthesis step ensures they receive attention weight comparable to a fresh instruction. This structural prompt design pattern costs only a small number of tokens and substantially reduces the rate at which agents ignore early instructions when producing final outputs after long multi-step execution.
Structured note-taking during agent execution externalizes working memory to persistent storage, enabling longer tasks than the context window alone supports. An agent designed for a long-running competitive analysis task can be given a tool to write intermediate findings to an external scratchpad or document. Rather than accumulating all tool results in the context window, the agent periodically summarizes what it has learned into the external scratchpad and clears the processed content from active context. This pattern mimics how a human researcher takes notes during a long project rather than trying to hold all gathered information simultaneously in short-term memory. The external scratchpad persists across context resets and is retrieved selectively as needed, effectively extending working memory beyond the context window limit.
Evaluating agent task completion on a representative sample of long tasks identifies working memory degradation before it affects production workflows. Working memory failures are systematic: they occur predictably when context length crosses a threshold relative to the model’s effective context window. Testing an agent on a representative long task, recording the context length at each step, and checking whether the output reflects instructions given early in the task versus only instructions that appeared late in the context provides empirical data on where working memory degradation begins. For the tasks that fail, identifying the context length at failure informs the design of summarization and state externalization checkpoints that keep active context within the reliable working range.
An agency builds an AI agent for automated competitive briefing: given a client, the agent searches for the top 8 competitors, retrieves recent news and positioning for each, summarizes each competitor’s strategy, and synthesizes a 2-page briefing with positioning recommendations. Initial testing on a 128k context window model shows the agent successfully completing briefings for competitors with limited news coverage (total context at synthesis: 42,000 tokens) but producing briefings that ignore the specific competitive framing instructions given at the start of the task for data-rich competitors (total context at synthesis: 103,000 tokens). The failure pattern is consistent: the agent’s synthesis section does not reference the client-specific positioning angle (instruction given at token 400) and focuses instead on generic competitive observations. The agency diagnoses this as a working memory issue: at 103,000 tokens of context, the opening instruction is far from the synthesis point and receives insufficient attention weight. Three design changes are implemented. First, the competitive framing instructions are injected immediately before the synthesis prompt, not only at the task start. Second, for each competitor research phase, the agent writes a 150-word structured summary to an external scratchpad and the raw tool results are cleared from the active context after summarization. Third, at synthesis time, the agent retrieves only the 8 compressed competitor summaries (approximately 1,200 tokens total) rather than the full accumulated research context. After these changes, total context at synthesis is consistently below 20,000 tokens and the produced briefings correctly apply the client-specific competitive framing across all test cases, including those with extensive competitor coverage. The agent’s output quality is rated as “meets the brief” by agency strategists in 88% of test cases versus 54% before the working memory management redesign.
The generative AI foundations module covers working memory in AI agents including context window management, state externalization strategies, progressive summarization, and the agent architecture patterns that maintain reliable task execution beyond single-context-window length.