ASR
Automatic Speech Recognition
Software that converts spoken audio into text. Whisper, AssemblyAI, Deepgram, Google Speech-to-Text are all ASR engines.
In depth
ASR (Automatic Speech Recognition) is the umbrella term for software that transcribes speech to text. Modern ASR is dominated by transformer-based models (OpenAI Whisper, AssemblyAI's Universal-1, Deepgram Nova, Google Chirp). Quality is measured in WER (word error rate); the best 2025-era models hit 4–8% WER on clean studio audio and 12–20% on noisy or accented speech. For caption generation, the practical metric is whether word-level timestamps are accurate, not just the words.
When to use it
Use ASR for any captioning workflow that doesn't require human-grade legal/medical accuracy. Hand-edit the output for proper nouns and technical vocabulary.
Frequently asked
What's the difference between ASR and transcription?+
ASR is the technology; transcription is the output. Human transcription is also a thing — slower, more accurate, more expensive.
Is Whisper the best ASR?+
Whisper is the best general-purpose open-weights ASR as of 2026. Commercial APIs (AssemblyAI, Deepgram) sometimes edge it on specific domains. SoCaptions runs Whisper.
OpenAI's open-source automatic speech recognition model. The de facto baseline for AI subtitle generation and the engine behind most modern caption tools.
Timing data that marks the start and end of every word, not just every cue. The foundation for karaoke captions and word-by-word reveal animations.
A transcript is a written record of spoken content. Subtitles are timed text overlaid on a video. Same words, different deliverable.