Subtitle terms,
defined.
File formats, reading-speed metrics, and accessibility types — in plain English.
The most common subtitle file format. Plain text with numbered cues and HH:MM:SS,mmm timestamps.
The W3C web standard for subtitles. Used by HTML5 <track> elements. Like SRT but with dot-separated milliseconds and styling support.
A heavily-styled subtitle format used by Aegisub and the anime fansub community. Supports per-cue fonts, colors, positioning, and karaoke timing.
A simple subtitle format used by the YouTube uploader. Like SRT but with comma-separated timestamps and no cue numbers.
An XML-based subtitle format used by streaming services and broadcast workflows. Powerful styling and positioning, but verbose.
An older subtitle profile of TTML used by Netflix and Adobe Flash. Internally just TTML XML with a .dfxp extension.
A subtitle-like format used for synchronized song lyrics. Plain text with [mm:ss.xx] timestamps before each line.
An MP4 video file with subtitles permanently rendered into the pixel data, not as a separate caption track.
Apple's TTML 1.0 profile for subtitles in Final Cut Pro and iTunes Connect (Apple TV+). A strict XML format with limited per-element styling.
The modern W3C streaming profile of TTML used by Netflix, Apple TV+, Disney+, and most streaming services for caption delivery.
A synchronized lyrics format used by music players (Spotify, Apple Music, Musixmatch) and karaoke apps. Each line is timestamped with [mm:ss.xx] format.
Subtitles that include speaker labels and non-speech audio cues like [music], [door slams], so deaf and hard-of-hearing viewers get the full experience.
Captions that the viewer can toggle on or off, typically delivered as a separate text track encoded into or alongside the video.
Captions burned permanently into the video frame so every viewer sees them. The opposite of closed captions, which can be toggled.
Subtitles that appear automatically only when needed — typically for foreign-language dialogue, on-screen signs, or burned-in graphics in an otherwise same-language video.
Subtitles permanently rendered into the video frame. They can't be turned off, but they look identical on every player and platform.
Subtitles rasterized into the video pixels during export. Identical on every platform, but viewers can't toggle them off.
The process of identifying which speaker said which words in an audio recording. Critical for interviews, podcasts, and any multi-speaker content.
Timing data that marks the start and end of every word, not just every cue. The foundation for karaoke captions and word-by-word reveal animations.
The process of aligning a known transcript to audio to produce precise word-level timestamps. Used to upgrade sentence-level timing to word-level.
Software that converts spoken audio into text. Whisper, AssemblyAI, Deepgram, Google Speech-to-Text are all ASR engines.
The standard accuracy metric for ASR. Measures the percentage of words wrongly transcribed (substitutions, insertions, deletions).
Subtitles where the existing text is automatically time-aligned to the audio, instead of being transcribed from scratch.
Animated text that emphasizes spoken words through motion, scale, color, or position. Common in viral short-form captions.
Captions that highlight each word as it's spoken, syncing color or emphasis to the audio in real time.
A text prefix or color cue that identifies who is speaking. Used in interviews, podcasts, and SDH captions.
The W3C-published accessibility standard. Captions for prerecorded video are required at WCAG 2.1 Level A.
The US federal accessibility standard. Requires captions and transcripts on all federal-government-procured video content.
A separate narration track that describes visual events on screen for blind and low-vision viewers. WCAG 2.1 Level AA requires it for prerecorded video.
A font feature where every digit has the same width, so numbers don't shift when they change. Important for live counters and timecode displays.
The spacing adjustment between specific letter pairs. Affects subtitle legibility, especially for thin display fonts and ALL-CAPS text.