AI Glossary · Letter W

Word Error Rate.

The standard metric for evaluating the accuracy of automatic speech recognition systems, measuring the minimum number of word substitutions, insertions, and deletions required to transform the system’s transcribed output into the correct reference transcript, expressed as a percentage of the total words in the reference. Word error rate determines whether AI transcription tools are reliable enough to use as the basis for post-call analysis, meeting intelligence, and voice-driven workflows in agency operations.

Also known as WER, transcription error rate, ASR accuracy

What it is

A working definition of word error rate.

Word error rate is computed by aligning the recognized transcript to the reference transcript and counting the minimum edits required to transform one into the other. A substitution occurs when the recognizer produces a different word than the reference (recognizing “marketing” as “marking”). An insertion occurs when the recognizer produces a word that is not in the reference (adding “the” where none exists). A deletion occurs when the recognizer fails to produce a word that is in the reference (dropping “not” from “should not”). WER equals the total number of substitutions, insertions, and deletions divided by the total number of words in the reference, expressed as a percentage. A WER of 10% means one in every ten words contains an error.

WER values vary substantially across speaking conditions. Broadcast news speech with a single speaker, clear audio, and standard vocabulary achieves WER below 5% on modern systems. Conversational speech with multiple speakers, informal vocabulary, overlapping talk, and background noise produces WER of 15 to 30% for standard systems. Domain-specific vocabulary, strong accents, technical terminology, and poor microphone conditions all increase WER substantially. Meeting and conference call transcription achieves WER of 10 to 20% in typical conditions, with lower error rates for enrolled speakers and higher rates for large groups with frequent interruptions.

WER measures the transcription system’s output quality but not the downstream impact on the tasks that use the transcript. A 15% WER transcript may be perfectly usable for extracting action items and sentiment trends but completely unreliable for exact quote attribution or compliance documentation. The acceptable WER threshold depends on what the transcript is used for: rough notes and topic extraction tolerate higher WER than verbatim records or systems that parse specific phrases for classification. Evaluating WER on a sample of representative recordings from the intended deployment context is the correct method for determining whether a transcription service meets the specific quality requirements of the planned use case.

Why ad agencies care

Why word error rate determines which post-call AI workflows are reliable and which require human correction in agency operations.

A working ad agency deploying AI transcription for client call notes, internal meeting records, and voice-driven workflow automation is implicitly accepting the WER of the transcription service it uses. That WER, measured on the agency’s actual call conditions rather than on benchmark datasets, determines whether downstream AI systems built on transcripts produce reliable outputs or accumulate errors. An action item extraction model trained on clean transcripts will produce lower-quality extractions on transcripts with 20% WER because key verbs and subjects may be misrecognized. Understanding WER and how to measure it on representative samples enables agencies to set realistic expectations for AI-transcription-dependent workflows and to identify when transcription quality is the binding constraint on downstream output quality.

WER on representative call samples predicts the correction effort required before AI-generated meeting notes are client-ready. An agency that processes client calls through an AI transcription service and uses those transcripts to generate structured meeting notes should measure WER on a sample of 20 to 30 representative calls before establishing the workflow. A WER of 8% on clear two-party calls suggests that generated meeting notes will require light editing (approximately 5 to 10 minutes of correction per 60-minute call). A WER of 22% on multi-party calls with variable audio quality suggests that generated notes will require substantial correction (20 to 35 minutes per call), potentially offsetting the time savings from AI-assisted drafting. This empirical calibration allows the agency to scope the workflow correctly and staff for the actual correction burden rather than assuming best-case transcription quality.

Domain-specific vocabulary and product name recognition requires custom vocabulary configuration to bring WER to acceptable levels for client brand contexts. Standard transcription services train on general speech corpora and will misrecognize domain-specific product names, technical terminology, and brand-specific language that is rare in general training data. An agency working with a pharmaceutical client whose products have names unlike common English words, or a technology client with coined brand terms, will see elevated WER specifically on the product and brand vocabulary that is most critical to accurate documentation. Most enterprise transcription services support custom vocabulary lists, phrase boosts, or fine-tuning on domain-specific audio that can reduce WER on critical terminology by 40 to 60%. Configuring these domain adaptations before deploying transcription workflows for clients with specialized vocabulary is the single highest-leverage quality improvement available without changing transcription providers.

Multi-speaker diarization error compounds with WER to determine the accuracy of speaker-attributed transcripts used for compliance and CRM documentation. Post-call analysis workflows that attribute statements to specific speakers depend on both transcription accuracy (WER) and diarization accuracy (who said which segment). A transcript with 10% WER and 8% diarization error rate will have errors in roughly 18% of attributed statements, which may be acceptable for general note-taking but is inadequate for compliance documentation requiring verbatim attribution. Evaluating these two error sources together, on the specific call types and conditions used in the intended workflow, produces a realistic assessment of transcript quality for the specific documentation use case.

In practice

What word error rate looks like inside a working ad agency.

An agency is selecting an AI transcription platform for its 18-person account and strategy team that conducts approximately 200 client calls per month. The team uses meeting transcripts for three purposes: generating structured post-call notes (high WER tolerance, topic-level accuracy sufficient), identifying client commitments for CRM logging (medium WER tolerance, action item verbs must be accurate), and extracting verbatim client quotes for inclusion in internal briefings (low WER tolerance, exact wording required). The agency evaluates three transcription services using a test set of 25 representative calls: a mix of 2-person client-agency calls on video conferencing, 4 to 6 person client workshop calls, and 2-person calls from mobile with variable audio quality. WER is measured against manually verified reference transcripts produced by a professional transcription service. Service A achieves WER of 9.2% on 2-person video calls and 18.7% on multi-party workshop calls. Service B achieves WER of 7.4% on 2-person calls and 14.1% on multi-party calls, but lacks custom vocabulary support for the agency’s technology and marketing clients. Service C achieves WER of 11.3% on 2-person calls and 16.4% on multi-party calls with standard vocabulary, but reduces to 6.8% and 11.9% respectively after custom vocabulary configuration for the agency’s 3 largest client domains. The agency selects Service C based on the post-configuration WER performance, which best meets the verbatim quote extraction requirement for high-value client calls. Custom vocabulary lists are configured for all 8 clients with product-specific terminology within the first month. Post-deployment tracking shows quote extraction accuracy (manually verified on a 40-call sample) of 87%, meeting the team’s requirement for verbatim quote use in briefings, and post-call note generation time decreasing from 28 minutes to 11 minutes average per call.

Build the speech AI evaluation expertise that determines whether transcription tools meet the accuracy requirements of voice-driven agency workflows through The Creative Cadence Workshop.

The generative AI foundations module covers word error rate including WER measurement methodology, the factors that increase WER in real-world call conditions, custom vocabulary configuration, and how to set WER thresholds calibrated to the specific downstream use of meeting transcripts in agency operations.