WER
Word error rate
The standard accuracy metric for ASR. Measures the percentage of words wrongly transcribed (substitutions, insertions, deletions).
In depth
Word error rate (WER) is the percentage of words in the ASR output that differ from the reference transcript, summed across substitutions, insertions, and deletions, divided by the reference word count. WER 5% means 1 in 20 words is wrong. State-of-the-art ASR hits 4–8% WER on clean English studio audio in 2026. Real-world podcast/field audio is typically 10–15%. Heavy accents or domain-specific vocabulary can push WER above 25% even on top models.
When to use it
Use WER to compare ASR vendors or to set quality SLAs. For caption authoring, the practical proxy is 'how many manual edits per minute of video' — a 5% WER often translates to 1–2 fixes per minute.
Frequently asked
What WER counts as 'good'?+
Under 8% on clean English audio is competitive in 2026. Under 5% is best-in-class. Domain-specific tuning (medical, legal) can push specialized providers below 3%.
Why does my podcast WER look worse than the marketing claims?+
Vendor benchmarks are usually run on clean corporate audio (LibriSpeech, Common Voice). Podcast audio with two speakers, occasional cross-talk, and background music routinely doubles or triples the published WER.