Speaker diarization
Speaker diarization (who-spoke-when)
The process of identifying which speaker said which words in an audio recording. Critical for interviews, podcasts, and any multi-speaker content.
In depth
Speaker diarization labels each segment of a transcript with a speaker identity (Speaker 1, Speaker 2…). It runs as a separate step from speech-to-text — typically using a model like pyannote that clusters voice embeddings, then aligns the clusters with the ASR output. Diarization is hard: overlapping speech, similar voices, varying mic distance, and short utterances all trip up modern systems. Even strong pipelines hit diarization error rates of 10–25%.
When to use it
Use diarization for podcasts, interviews, panel discussions, meeting transcripts, and SDH where labeling speakers improves comprehension. Skip it for single-speaker talking-head video — you don't need it.
Frequently asked
Does Whisper do speaker diarization?+
Whisper itself doesn't — it transcribes words and timestamps. Diarization is bolted on by a separate model, most commonly pyannote-audio. WhisperX is a popular pipeline that runs Whisper + pyannote together.
Can it identify named speakers?+
Out of the box, no — diarization labels are anonymous (Speaker 1, Speaker 2). Naming them is a separate step: speaker enrollment with reference audio, or manual labeling after the fact.
How accurate is speaker diarization in 2026?+
Diarization Error Rate (DER) of 10–20% is realistic for clean two-speaker audio. Three or more speakers, overlap, and noisy audio push it higher. For broadcast quality, expect human review.
OpenAI's open-source automatic speech recognition model. The de facto baseline for AI subtitle generation and the engine behind most modern caption tools.
Timing data that marks the start and end of every word, not just every cue. The foundation for karaoke captions and word-by-word reveal animations.
Subtitles that include speaker labels and non-speech audio cues like [music], [door slams], so deaf and hard-of-hearing viewers get the full experience.