← Glossary · AI / transcription

WER

Word error rate

The standard accuracy metric for ASR. Measures the percentage of words wrongly transcribed (substitutions, insertions, deletions).

In depth

Word error rate (WER) is the percentage of words in the ASR output that differ from the reference transcript, summed across substitutions, insertions, and deletions, divided by the reference word count. WER 5% means 1 in 20 words is wrong. State-of-the-art ASR hits 4–8% WER on clean English studio audio in 2026. Real-world podcast/field audio is typically 10–15%. Heavy accents or domain-specific vocabulary can push WER above 25% even on top models.

When to use it

Use WER to compare ASR vendors or to set quality SLAs. For caption authoring, the practical proxy is 'how many manual edits per minute of video' — a 5% WER often translates to 1–2 fixes per minute.

Frequently asked

What WER counts as 'good'?+

Under 8% on clean English audio is competitive in 2026. Under 5% is best-in-class. Domain-specific tuning (medical, legal) can push specialized providers below 3%.

Why does my podcast WER look worse than the marketing claims?+

Vendor benchmarks are usually run on clean corporate audio (LibriSpeech, Common Voice). Podcast audio with two speakers, occasional cross-talk, and background music routinely doubles or triples the published WER.

Related terms
Skip the file-format gymnastics.
Drop a video into the SoCaptions editor — get ready-to-publish captions in any format.
Try free