Glossary

Subtitle terms,
defined.

File formats, reading-speed metrics, and accessibility types — in plain English.

File format

The most common subtitle file format. Plain text with numbered cues and HH:MM:SS,mmm timestamps.

The W3C web standard for subtitles. Used by HTML5 <track> elements. Like SRT but with dot-separated milliseconds and styling support.

ASS

A heavily-styled subtitle format used by Aegisub and the anime fansub community. Supports per-cue fonts, colors, positioning, and karaoke timing.

SBV

A simple subtitle format used by the YouTube uploader. Like SRT but with comma-separated timestamps and no cue numbers.

TTML

An XML-based subtitle format used by streaming services and broadcast workflows. Powerful styling and positioning, but verbose.

DFXP

An older subtitle profile of TTML used by Netflix and Adobe Flash. Internally just TTML XML with a .dfxp extension.

LRC

A subtitle-like format used for synchronized song lyrics. Plain text with [mm:ss.xx] timestamps before each line.

Burned-in MP4

An MP4 video file with subtitles permanently rendered into the pixel data, not as a separate caption track.

iTT

Apple's TTML 1.0 profile for subtitles in Final Cut Pro and iTunes Connect (Apple TV+). A strict XML format with limited per-element styling.

IMSC

The modern W3C streaming profile of TTML used by Netflix, Apple TV+, Disney+, and most streaming services for caption delivery.

LRC

A synchronized lyrics format used by music players (Spotify, Apple Music, Musixmatch) and karaoke apps. Each line is timestamped with [mm:ss.xx] format.

Reading-speed metric

CPS

The reading-speed metric used by professional subtitle standards. Total characters in a cue divided by its duration in seconds.

WPM

Reading speed expressed in words per minute. Easier to intuit than CPS but less consistent across languages.

Subtitle type

SDH

Subtitles that include speaker labels and non-speech audio cues like [music], [door slams], so deaf and hard-of-hearing viewers get the full experience.

Closed captions

Captions that the viewer can toggle on or off, typically delivered as a separate text track encoded into or alongside the video.

Open captions

Captions burned permanently into the video frame so every viewer sees them. The opposite of closed captions, which can be toggled.

Forced subtitles

Subtitles that appear automatically only when needed — typically for foreign-language dialogue, on-screen signs, or burned-in graphics in an otherwise same-language video.

Hardcoded subtitles

Subtitles permanently rendered into the video frame. They can't be turned off, but they look identical on every player and platform.

Burned-in subtitles

Subtitles rasterized into the video pixels during export. Identical on every platform, but viewers can't toggle them off.

Transcription engine

Whisper

OpenAI's open-source automatic speech recognition model. The de facto baseline for AI subtitle generation and the engine behind most modern caption tools.

Transcription concept

Speaker diarization

The process of identifying which speaker said which words in an audio recording. Critical for interviews, podcasts, and any multi-speaker content.

Word-level timestamps

Timing data that marks the start and end of every word, not just every cue. The foundation for karaoke captions and word-by-word reveal animations.

Force-aligned

The process of aligning a known transcript to audio to produce precise word-level timestamps. Used to upgrade sentence-level timing to word-level.

Subtitle concept

Transcript vs subtitles

A transcript is a written record of spoken content. Subtitles are timed text overlaid on a video. Same words, different deliverable.

Video technical

Frame rate

How many video frames are shown per second. Common rates: 23.976, 24, 25, 29.97, 30, 60. Affects subtitle timing precision.

Drop-frame timecode

A timecode convention for 29.97 fps video that drops 2 frame numbers per minute (except every 10th) to keep timecode aligned with real time.

AI / transcription

ASR

Software that converts spoken audio into text. Whisper, AssemblyAI, Deepgram, Google Speech-to-Text are all ASR engines.

WER

The standard accuracy metric for ASR. Measures the percentage of words wrongly transcribed (substitutions, insertions, deletions).

Force-aligned

Subtitles where the existing text is automatically time-aligned to the audio, instead of being transcribed from scratch.

Caption styling

Kinetic typography

Animated text that emphasizes spoken words through motion, scale, color, or position. Common in viral short-form captions.

Karaoke captions

Captions that highlight each word as it's spoken, syncing color or emphasis to the audio in real time.

Speaker label

A text prefix or color cue that identifies who is speaking. Used in interviews, podcasts, and SDH captions.

Accessibility

WCAG

The W3C-published accessibility standard. Captions for prerecorded video are required at WCAG 2.1 Level A.

Section 508

The US federal accessibility standard. Requires captions and transcripts on all federal-government-procured video content.

Audio description

A separate narration track that describes visual events on screen for blind and low-vision viewers. WCAG 2.1 Level AA requires it for prerecorded video.

Typography

Tabular figures

A font feature where every digit has the same width, so numbers don't shift when they change. Important for live counters and timecode displays.

Kerning

The spacing adjustment between specific letter pairs. Affects subtitle legibility, especially for thin display fonts and ALL-CAPS text.

Broadcast standard

CEA-608

The original US closed-caption standard for analog NTSC TV. Now used as embedded text tracks inside MP4 / TS broadcast files.

IMSC

The W3C profile of TTML used by Netflix, Apple, and most streaming services. Combines TTML's styling with HLS/DASH delivery.

Subtitle terms,defined.

Subtitle terms,
defined.