Subtitle terms,
defined.
File formats, reading-speed metrics, and accessibility types — in plain English.
The most common subtitle file format. Plain text with numbered cues and HH:MM:SS,mmm timestamps.
The W3C web standard for subtitles. Used by HTML5 <track> elements. Like SRT but with dot-separated milliseconds and styling support.
A heavily-styled subtitle format used by Aegisub and the anime fansub community. Supports per-cue fonts, colors, positioning, and karaoke timing.
A simple subtitle format used by the YouTube uploader. Like SRT but with comma-separated timestamps and no cue numbers.
An XML-based subtitle format used by streaming services and broadcast workflows. Powerful styling and positioning, but verbose.
An older subtitle profile of TTML used by Netflix and Adobe Flash. Internally just TTML XML with a .dfxp extension.
A subtitle-like format used for synchronized song lyrics. Plain text with [mm:ss.xx] timestamps before each line.
An MP4 with captions permanently baked into the pixel data — always visible on every platform, no toggle needed. The standard export format for TikTok, Reels, Shorts, and X where separate caption tracks are unreliable or invisible.
Apple's TTML 1.0 profile for subtitles in Final Cut Pro and iTunes Connect (Apple TV+). A strict XML format with limited per-element styling.
The modern W3C streaming profile of TTML used by Netflix, Apple TV+, Disney+, and most streaming services for caption delivery.
A synchronized lyrics format used by music players (Spotify, Apple Music, Musixmatch) and karaoke apps. Each line is timestamped with [mm:ss.xx] format.
Subtitles that include speaker labels and non-speech audio cues like [music], [door slams], so deaf and hard-of-hearing viewers get the full experience.
Captions that the viewer can toggle on or off, typically delivered as a separate text track encoded into or alongside the video.
Captions burned permanently into the video frame so every viewer sees them. The opposite of closed captions, which can be toggled.
Subtitles that appear automatically only when needed — typically for foreign-language dialogue, on-screen signs, or burned-in graphics in an otherwise same-language video.
Hardcoded subtitles (also called burned-in or open captions) are permanently rendered into the video pixels — identical on every player and platform, but viewers can't turn them off. Required for TikTok, Reels, and Shorts where caption tracks are hidden during feed autoplay.
Subtitles rasterized into the video pixels during export. Identical on every platform, but viewers can't toggle them off.
The process of identifying which speaker said which words in an audio recording. Critical for interviews, podcasts, and any multi-speaker content.
Timing data that marks the start and end of every word, not just every cue. The foundation for karaoke captions and word-by-word reveal animations.
The process of aligning a known transcript to audio to produce precise word-level timestamps. Used to upgrade sentence-level timing to word-level.
Software that converts spoken audio into text. Whisper, AssemblyAI, Deepgram, Google Speech-to-Text are all ASR engines.
The standard accuracy metric for ASR. Measures the percentage of words wrongly transcribed (substitutions, insertions, deletions).
Subtitles where the existing text is automatically time-aligned to the audio, instead of being transcribed from scratch.
Animated text that emphasizes spoken words through motion, scale, color, or position. Common in viral short-form captions.
Captions that highlight each word as it's spoken, syncing color or emphasis to the audio in real time.
A text prefix or color cue that identifies who is speaking. Used in interviews, podcasts, and SDH captions.
The W3C-published accessibility standard. Captions for prerecorded video are required at WCAG 2.1 Level A.
The US federal accessibility standard. Requires captions and transcripts on all federal-government-procured video content.
A separate narration track that describes visual events on screen for blind and low-vision viewers. WCAG 2.1 Level AA requires it for prerecorded video.
A font feature where every digit has the same width so numbers don't shift when they change — important for timecode displays, live counters, and subtitle timing.
The spacing adjustment between specific letter pairs. Affects subtitle legibility, especially for thin display fonts and ALL-CAPS text.
CEA-608 (Line 21) is the FCC-mandated closed-caption standard for US broadcast TV and cable. Still required in MP4/MXF deliverables for ABC, NBC, CBS, FOX, and cable operators in 2026. CEA-708 is the digital successor — most deliverable specs require both.
The W3C profile of TTML used by Netflix, Apple, and most streaming services. Combines TTML's styling with HLS/DASH delivery.