Speaker diarization (Speaker diarization (who-spoke-when))

Speaker diarization

Speaker diarization (who-spoke-when)

The process of identifying which speaker said which words in an audio recording. Critical for interviews, podcasts, and any multi-speaker content.

In depth

Speaker diarization labels each segment of a transcript with a speaker identity (Speaker 1, Speaker 2…). It runs as a separate step from speech-to-text — typically using a model like pyannote that clusters voice embeddings, then aligns the clusters with the ASR output. Diarization is hard: overlapping speech, similar voices, varying mic distance, and short utterances all trip up modern systems. Even strong pipelines hit diarization error rates of 10–25%.

When to use it

Use diarization for podcasts, interviews, panel discussions, meeting transcripts, and SDH where labeling speakers improves comprehension. Skip it for single-speaker talking-head video — you don't need it.

Frequently asked

Does Whisper do speaker diarization?+

Whisper itself doesn't — it transcribes words and timestamps. Diarization is bolted on by a separate model, most commonly pyannote-audio. WhisperX is a popular pipeline that runs Whisper + pyannote together.

Can it identify named speakers?+

Out of the box, no — diarization labels are anonymous (Speaker 1, Speaker 2). Naming them is a separate step: speaker enrollment with reference audio, or manual labeling after the fact.

How accurate is speaker diarization in 2026?+

Diarization Error Rate (DER) of 10–20% is realistic for clean two-speaker audio. Three or more speakers, overlap, and noisy audio push it higher. For broadcast quality, expect human review.

Skip the file-format gymnastics.

Drop a video into the SoCaptions editor — get ready-to-publish captions in any format.

Try free