Word-level timestamps
Word-level timestamps
Timing data that marks the start and end of every word, not just every cue. The foundation for karaoke captions and word-by-word reveal animations.
In depth
Standard subtitle formats (SRT, VTT) store one timestamp pair per cue — usually a chunk of 1–8 words. Word-level timestamps go finer: every word gets its own start and end. This unlocks karaoke-style highlighting, word-by-word reveal animations, accurate click-to-jump in transcript players, and precise text editing of audio in tools like Descript. ASS supports word-level timing natively via \k karaoke tags; VTT supports it via inline <c> cues; SRT does not.
When to use it
When you're building a karaoke-style caption animation, a transcript-driven editor, or any UI where users interact with words rather than cues. For static delivery captions, segment-level is enough.
Frequently asked
How are word-level timestamps generated?+
ASR models output them as a side-effect of decoding. Whisper's segment output can be aligned to word level via WhisperX or whisper-timestamped, which run a phoneme alignment model on top of Whisper output.
How accurate are word-level timestamps?+
On clean speech, ±50ms is typical. On fast speech, slurred speech, or noisy audio, accuracy degrades. For karaoke captions the human eye tolerates ±100ms, so even rough alignment looks good.
Can I store word-level timestamps in SRT?+
Not natively. The common workarounds are: store inline timestamps in the cue text (some tools tolerate this), use VTT with cue tags, or use ASS karaoke timing. For internal use, JSON is the cleanest.
OpenAI's open-source automatic speech recognition model. The de facto baseline for AI subtitle generation and the engine behind most modern caption tools.
A heavily-styled subtitle format used by Aegisub and the anime fansub community. Supports per-cue fonts, colors, positioning, and karaoke timing.
The W3C web standard for subtitles. Used by HTML5 <track> elements. Like SRT but with dot-separated milliseconds and styling support.