← Glossary · Transcription engine

Whisper

OpenAI Whisper (ASR model)

OpenAI's open-source automatic speech recognition model. The de facto baseline for AI subtitle generation and the engine behind most modern caption tools.

In depth

Whisper is an encoder-decoder transformer trained by OpenAI on 680,000 hours of multilingual audio. It transcribes speech to text in 99 languages, translates non-English speech directly to English, and ships in five sizes (tiny → large) and a faster turbo variant. Most modern AI caption tools — including SoCaptions — call Whisper either through OpenAI's API or via self-hosted faster-whisper / whisper.cpp. Word error rates are typically 5–10% on clean English and competitive with proprietary engines on accented and multilingual audio.

When to use it

When you need transcription that handles accents, code-switching, technical vocabulary, and less-resourced languages reasonably well without per-minute pricing. Whisper is the default for anyone building a captions pipeline today.

Frequently asked

Is Whisper free to use?+

The model weights are open-source under MIT — you can self-host for the cost of compute. OpenAI's hosted API charges per audio minute. Most consumer caption tools wrap one or the other.

Does Whisper produce word-level timestamps?+

Vanilla Whisper outputs segment-level timestamps. Word-level timing requires the WhisperX pipeline or whisper-timestamped wrapper, which align Whisper output with a phoneme model. Most modern caption tools include this.

Which Whisper model size should I use?+

For consumer captions, large-v3 or turbo are the practical choices — they're accurate enough and run fast on a single GPU. tiny and base are too inaccurate for production captions; medium is a reasonable middle ground.

Related terms
Skip the file-format gymnastics.
Drop a video into the SoCaptions editor — get ready-to-publish captions in any format.
Try free