A natural language processing task in which an AI system answers questions posed in natural language, either by extracting relevant spans from a provided document, generating answers from a language model’s trained knowledge, or combining retrieval and generation. Question answering enables AI systems to serve as knowledge interfaces, allowing users to query information through natural language rather than structured search.
Also known as QA, extractive QA, open-domain QA
Question answering systems come in several varieties depending on how they source their answers. Extractive QA identifies and returns the specific span of text from a provided document that best answers the question, without generating new text. Reading comprehension models such as early BERT-based systems are extractive QA systems trained to identify the start and end tokens of the answer span in a context document. Generative QA generates a natural language answer from the model’s trained parameters without requiring an external document, drawing on patterns learned during pre-training. Open-domain QA combines retrieval and generation: a retriever finds relevant documents from a large corpus, and a reader generates an answer conditioned on those documents.
Retrieval-augmented generation represents the current practical standard for knowledge-grounded question answering in production systems. RAG retrieves relevant context documents for each question and passes them to a language model that generates an answer citing the retrieved context. RAG systems produce more factually accurate and verifiable answers than pure generative QA because the answer is grounded in retrieved documents that can be inspected, rather than relying entirely on potentially outdated or hallucinated knowledge from the model’s parameters. For enterprise applications requiring answers grounded in specific organizational knowledge, RAG provides a path to deploying question answering that is both capable and auditable.
Question answering evaluation uses metrics including exact match (the fraction of questions where the generated answer exactly matches the reference answer), F1 score over token overlap between generated and reference answers, and human evaluation of factual correctness and completeness. For open-domain and generative QA, exact match is often too strict because correct answers can be phrased in many valid ways; F1 over token overlap is more lenient. Human evaluation remains the gold standard for assessing answer quality on nuanced questions where surface-level token overlap is a poor proxy for correctness.
A working ad agency that has accumulated a large knowledge base of campaign case studies, research documents, brand guidelines, and competitive intelligence derives value from that knowledge only if it can be efficiently accessed when needed. Traditional document search requires knowing which document to look for; question answering allows users to ask the specific question they have and receive a direct answer regardless of which document contains the relevant information. Building question answering interfaces over agency knowledge bases transforms a static document archive into a dynamic knowledge service that junior team members can query as efficiently as senior team members who know where to look.
Campaign brief question answering enables faster brief intake and strategy development. An AI question answering system trained on the agency’s accumulated brief and strategy corpus can answer questions that arise during brief intake, such as “what is the typical creative approach for a launch campaign in this category?” or “what messaging has performed best with this audience segment in past campaigns?” in seconds rather than requiring a strategist to manually search through past work. This knowledge retrieval capability accelerates the brief intake process and surfaces relevant precedents that might otherwise not be considered, particularly in agencies with extensive archives that individuals cannot fully keep in memory.
Customer service question answering deployed on client properties requires answer grounding and hallucination prevention. Deploying a question answering chatbot on a client’s e-commerce site or support portal introduces quality and liability risks if the QA system generates answers from model parameters rather than verified product and policy documentation. A QA system that answers “what is your return policy?” by generating text from training data may produce an incorrect answer if the client’s policy was updated after the model’s training cutoff. Grounding the QA system on the client’s current product and policy documentation through RAG ensures that answers reflect the current, authoritative source of truth and can be audited for accuracy. This is the standard architecture requirement for any QA system deployed in a customer-facing context.
QA system evaluation must test for both retrieval quality and generation accuracy separately. A RAG-based QA system can fail at two distinct points: retrieval failure, where the relevant document is not retrieved because the query-document similarity is insufficient, and generation failure, where the relevant document is retrieved but the model generates an incorrect or incomplete answer from it. These failure modes require different remediation: retrieval failures are addressed through improved embedding models, better chunking, or hybrid retrieval; generation failures are addressed through improved language model quality, better prompt engineering, or fine-tuning. Evaluating retrieval and generation separately in the QA pipeline diagnostic process is essential for correctly attributing failures and applying the right fix.
An agency builds a question answering system for its media planning team that can answer planning questions by retrieving relevant information from a corpus of 1,200 media research reports, platform specification sheets, and past media plan case studies. The system uses a RAG architecture with a dense retrieval model and a language model to generate synthesized answers. The agency evaluates the system’s performance on a test set of 80 planning questions with known correct answers, curated by senior media planners. Initial evaluation shows: retrieval precision at 5 (fraction of top-5 retrieved chunks that contain relevant information) of 0.61, meaning 3 of the 5 retrieved chunks per question are relevant on average. Answer accuracy (rated by the senior planners on a 3-point scale) of 2.1 out of 3.0 for questions with relevant documents retrieved, and 0.8 out of 3.0 for questions where retrieval failed to surface relevant chunks. The agency identifies the retrieval step as the primary bottleneck: improving retrieval precision from 0.61 to 0.80 would increase the fraction of questions where the language model has relevant context to work with, which based on the 2.1 vs 0.8 accuracy gap would have a larger impact than improving the language model. The agency focuses optimization effort on three retrieval improvements: adding media planning-specific synonym expansion (expanding “OTT” to include “streaming video” and “connected TV”), reindexing the corpus with smaller 256-token chunks to improve precision, and adding metadata filters that restrict retrieval to documents within a specified date range for time-sensitive questions. After these changes, retrieval precision improves to 0.79 and overall answer accuracy improves from 2.1 to 2.6 on the correctly-retrieved subset. The focused retrieval engineering investment produces larger QA quality gains than the alternative of upgrading to a more capable language model would have at comparable cost.
The generative AI foundations module covers question answering architectures including extractive QA, retrieval-augmented generation, and RAG system evaluation, providing the technical foundation for building reliable knowledge interfaces on agency and client information assets.