A neural network design that pairs an encoder, which compresses an input into a compact internal representation, with a decoder, which reconstructs a desired output from that representation. Encoder-decoder architectures underlie translation, summarization, image generation, and most of the generative AI tools agencies use today.
Also known as seq2seq model, encode-decode model, sequence-to-sequence architecture
An encoder-decoder system processes input in two stages. The encoder reads the input, whether a sentence, an image, or an audio clip, and produces a compressed representation that captures the essential information in a form the decoder can work with. The decoder then takes that representation and generates the desired output: a translation, a summary, an image, or a piece of audio. The two components are trained together so the encoder learns to produce representations that the decoder can use effectively.
The architecture was originally developed for machine translation, where the encoder reads a sentence in one language and the decoder produces its equivalent in another. It generalized rapidly to any task that requires transforming one sequence or structure into another: document summarization, image captioning, speech synthesis, and image generation. Modern generative AI systems including image generators and text-to-speech tools are built on encoder-decoder foundations, often with attention mechanisms that allow the decoder to focus on the most relevant parts of the encoded representation at each generation step.
Large language models like GPT use a decoder-only architecture, meaning they have no separate encoder and generate output by predicting the next token based on prior context. Models like BERT use an encoder-only architecture for tasks that require understanding rather than generation, such as classification and search. Encoder-decoder models like T5 and BART are used for tasks that require transforming one text into another.
Most of the generative AI tools a working ad agency uses, from translation platforms to image generators to summarization tools, are built on encoder-decoder principles. Understanding how these systems work at a conceptual level helps agencies evaluate their capabilities accurately, identify why they fail in specific situations, and make informed decisions about which tools are appropriate for which tasks.
It explains why translation quality varies by language pair. A translation model trained on abundant English-French data will have a well-calibrated encoder-decoder for that pair and produce high-quality output. The same model trained on sparse data for a less common language pair will have a poorly calibrated representation and produce lower-quality output. Agencies running multilingual campaigns need to test translation tools on the specific language pairs they will use, not rely on aggregate quality claims.
It explains degradation on long inputs. Encoders compress inputs into fixed-size or attention-limited representations. Very long inputs, such as a 30-page research document, stress this compression and can cause the decoder to lose access to information from early in the document when generating output at the end. This is the architectural reason why large language models with limited context windows produce lower-quality summaries of long documents.
Tool selection maps to architecture. When an agency chooses between a generative tool for content creation versus a classification tool for content analysis, it is implicitly choosing between decoder-heavy and encoder-heavy architectures. Understanding this mapping helps agencies match tool architecture to task requirements rather than evaluating all AI tools on the same benchmarks regardless of what they are designed to do.
An agency is evaluating two AI tools for automating the translation of a consumer brand’s product catalog into eight languages. Both tools produce acceptable English-to-Spanish output on the vendor benchmark tests. When tested on the actual catalog content, which includes brand-specific terminology and product names that appear infrequently in public training corpora, Tool A maintains quality across all eight languages while Tool B produces high-quality output for the four most common language pairs and degrades significantly for the four less common ones. The agency recognizes this as an encoder training data coverage issue: Tool B’s encoder has weaker representations for the underrepresented language pairs because it was trained on less data for those pairs. Tool A is selected, and the specification includes a clause requiring quality benchmarking on the full language set before the contract is finalized.
The generative AI foundations module of the workshop covers how today’s models work, including the architectural differences that determine when a tool is right for the task and when it will fail.