Word-level timestamps (Word-level timestamps)

Word-level timestamps

Timing data that marks the start and end of every word, not just every cue. The foundation for karaoke captions and word-by-word reveal animations.

In depth

Standard subtitle formats (SRT, VTT) store one timestamp pair per cue — usually a chunk of 1–8 words. Word-level timestamps go finer: every word gets its own start and end. This unlocks karaoke-style highlighting, word-by-word reveal animations, accurate click-to-jump in transcript players, and precise text editing of audio in tools like Descript. ASS supports word-level timing natively via \k karaoke tags; VTT supports it via inline <c> cues; SRT does not.

When to use it

When you're building a karaoke-style caption animation, a transcript-driven editor, or any UI where users interact with words rather than cues. For static delivery captions, segment-level is enough.

Frequently asked

How are word-level timestamps generated?+

ASR models output them as a side-effect of decoding. Whisper's segment output can be aligned to word level via WhisperX or whisper-timestamped, which run a phoneme alignment model on top of Whisper output.

How accurate are word-level timestamps?+

On clean speech, ±50ms is typical. On fast speech, slurred speech, or noisy audio, accuracy degrades. For karaoke captions the human eye tolerates ±100ms, so even rough alignment looks good.

Can I store word-level timestamps in SRT?+

Not natively. The common workarounds are: store inline timestamps in the cue text (some tools tolerate this), use VTT with cue tags, or use ASS karaoke timing. For internal use, JSON is the cleanest.

Skip the file-format gymnastics.

Drop a video into the SoCaptions editor — get ready-to-publish captions in any format.

Try free