AI Glossary · Letter S

Speech Recognition.

A technology that converts spoken audio into text by processing acoustic signals through machine learning models trained to recognize phonemes, words, and language patterns. Speech recognition enables voice assistants, meeting transcription, call center analytics, podcast and video content indexing, and voice-activated creative production tools, making it a core enabling technology for several agency workflow automation applications.

Also known as automatic speech recognition, ASR, speech-to-text

What it is

A working definition of speech recognition.

Automatic speech recognition takes an audio waveform as input and produces a text transcript as output. Modern end-to-end neural ASR systems, including Whisper and similar transformer-based architectures, process audio spectrograms directly through an encoder-decoder model that jointly learns acoustic representations and language modeling. Earlier pipeline-based ASR systems separated acoustic modeling (mapping audio features to phoneme probabilities), lexicon lookup (mapping phoneme sequences to words), and language modeling (ranking word sequences by linguistic plausibility) into distinct components trained separately and combined at inference. End-to-end systems outperform pipeline approaches on most benchmarks because joint training allows the model to learn complementary representations across all three tasks simultaneously.

Word error rate (WER) is the standard ASR evaluation metric, calculated as the minimum edit distance (insertions, deletions, and substitutions) between the recognized transcript and the reference transcript, normalized by the length of the reference. WER of 5% means that on average 5 out of every 100 reference words are incorrectly recognized. General-purpose ASR models achieve WER below 5% on standard conversational speech benchmarks, but performance degrades substantially in domain-specific conditions: accented speech, technical vocabulary, noisy environments, and multi-speaker overlap all increase WER significantly. Domain-adapted models fine-tuned on domain-specific vocabulary and acoustic conditions achieve substantially lower WER than general models on domain-specific content.

Speaker diarization is a related task that identifies who spoke which portion of a transcript, assigning speaker labels to each segment. Diarization is required for multi-participant recordings such as client meetings, call center interactions, and focus group sessions where distinguishing between speakers is necessary for analysis. Speaker diarization is evaluated separately from transcription accuracy using diarization error rate (DER), which measures the fraction of time incorrectly attributed to the wrong speaker. Accurate diarization is harder than transcription and degrades more sharply with overlapping speech, which is common in natural conversation.

Why ad agencies care

Why speech recognition is the enabling technology for meeting intelligence, call analysis, and voice content workflows in agency operations.

A working ad agency that records client calls, briefs, and creative reviews has a valuable but largely unstructured data asset: the spoken record of client preferences, feedback, objections, and decisions that are currently captured only in handwritten notes or not at all. Speech recognition converts this audio record into searchable, analyzable text, enabling systematic extraction of client insights, action item tracking, call quality evaluation, and brief documentation that reduces information loss from verbal-only communication and improves continuity when team members change.

Automated transcription of client calls and briefing sessions reduces post-call documentation time while improving completeness and accuracy of captured information. A client call that produces 4,000 to 6,000 words of spoken content in 45 minutes generates a complete written record in minutes via automated transcription, versus 30 to 60 minutes of manual note-taking that captures only a fraction of the discussion and reflects the note-taker’s interpretive filter rather than the actual words spoken. Automated transcription with post-transcription summary generation (using a language model to extract decisions, action items, and key feedback from the transcript) produces a structured call record that can be stored in the CRM and referenced in future meetings, creating an institutional memory for client relationships that reduces reliance on individual team member recall.

Call center conversation analytics built on speech recognition surfaces patterns in customer objections, satisfaction drivers, and agent performance that call monitoring alone cannot capture at scale. An agency managing call center analytics for a client can apply ASR to a 10 to 30% sample of call recordings (or 100% for automated classification tasks), converting spoken interactions to text and applying NLP analysis to identify recurring objection patterns, customer sentiment trajectories, keywords associated with calls that end in resolution versus escalation, and agent-specific communication patterns that predict customer satisfaction scores. This population-level analysis of call content is only possible at scale through ASR; manual call review is limited to 1 to 3% of calls in most operations.

Video and podcast content transcription enables SEO-optimized content extraction and searchable archives for media clients without manual captioning cost. A media client with an archive of 400 video episodes and a production rate of 4 new videos per week faces a captioning and transcript cost of $1.20 to $2.00 per minute through professional captioning services. Automated ASR for standard speech quality content costs $0.01 to $0.06 per minute through API-based services, reducing captioning cost by 95% or more. The transcripts enable full-text search of video content, automatic subtitle generation, content repurposing into blog posts or newsletters, and speaker-attributed quote extraction, all of which are not economically feasible at commercial captioning rates but are routine at ASR rates.

In practice

What speech recognition looks like inside a working ad agency.

An agency wants to build a systematic client feedback intelligence system that extracts actionable creative feedback from recorded creative review calls. The agency conducts 12 to 18 client review calls per week, each 30 to 60 minutes, for a portfolio of 22 active clients. Currently, feedback is captured only through handwritten notes of varying completeness that are not systematically stored or analyzed across clients. The agency implements a three-stage pipeline. Stage 1: ASR transcription using a cloud API (Whisper large-v3) with speaker diarization, producing timestamped transcripts with speaker labels for each call. The transcription step takes 3 to 5 minutes per hour of audio and costs approximately $0.36 per call at current API pricing. Stage 2: a language model prompt extracts structured feedback from each transcript, producing JSON output with fields: creative elements praised (list), creative elements criticized (list), requested changes (list with priority ranking), copy feedback (specific phrases flagged), brand voice notes, and next steps with owners. Stage 3: extracted feedback is stored in a shared database tagged by client, campaign, and call date, enabling cross-call analysis such as recurring objection patterns per client, most commonly flagged creative elements, and comparison of requested changes versus implemented changes. Over the first quarter of operation, the system processes 184 calls and extracts 2,940 structured feedback items. Cross-call analysis reveals that 8 of 22 clients consistently flag the same copy pattern (superlatives without qualification) as problematic, enabling a proactive writing guideline update that reduces this specific objection by 74% in subsequent reviews. The system also identifies that calls with 3 or more “requested change” items have 2.3 times higher probability of follow-up revision cycles, enabling the agency to flag high-revision-risk reviews for senior creative director attention before revisions are distributed.

Build the voice intelligence and conversation analysis capabilities that extract client insights from recorded meetings and calls through The Creative Cadence Workshop.

The generative AI foundations module covers speech recognition, speaker diarization, and the meeting intelligence and call analytics applications that transform audio records into structured intelligence for agency operations and client relationship management.