How Should Captions Work in a Video Editor?
Let's deep dive into how captions work in React Video Editor. I'll break down the current implementation and how captions are generated, but it's not quite perfect yet—I'm looking for feedback on how to improve it.
Sam
Creator of RVE
As part of Version 6 of React Video Editor, we're introducing a new captions feature. This is the most requested feature so far, and we're excited to ship it.
That being said, getting captions right is tricky. React Video Editor is designed as a foundation that developers can build on top of. The goal is to give users full control over their video tools without vendor lock-in or constraints on implementation.
So, captions are now part of the editor, but they're launching in beta. There are a lot of different ways to handle captions, and this post is meant to explore those approaches. I'll explain how captions currently work in the codebase and open up the discussion on what could be improved.
What Captions Look Like in RVE
Before we get into the details, here's an example of what captions look like inside React Video Editor.
Uploading and generating captions:
Write your own captions or upload a file from a speech-to-text service.
Editing captions:
Users can now edit the words within captions.
Styling captions:
Choose your own colors and styles for captions.
How Captions Work in React Video Editor
Captions in React Video Editor are structured like other overlays but are specifically designed to handle timed text. Currently, captions can be generated in two ways:
- Manual Input – Users type their text, and captions are generated with estimated timings.
- File Upload – Users upload JSON-based speech recognition data, which generates captions with precise word-level timing.
Regardless of how they're created, captions are fully editable inside the timeline and support custom styling.
1. Generating Captions via Manual Input
The first method allows users to manually enter text, and RVE will generate timing based on estimated reading speed.
const generateCaptions = () => { // Split text into sentences using punctuation const sentences = script .split(/[.!?]+/) .map((sentence) => sentence.trim()) .filter((sentence) => sentence.length > 0); // Calculate timing based on average reading speed const wordsPerMinute = 160; const msPerWord = (60 * 1000) / wordsPerMinute; let currentStartTime = 0; const processedCaptions: Caption[] = sentences.map((sentence) => { const words = sentence.split(/\s+/); const sentenceStartTime = currentStartTime; // Create timing for each word const processedWords = words.map((word, index) => ({ word, startMs: sentenceStartTime + index * msPerWord, endMs: sentenceStartTime + (index + 1) * msPerWord, confidence: 0.99, })); // Create caption segment const caption: Caption = { text: sentence, startMs: sentenceStartTime, endMs: sentenceStartTime + words.length * msPerWord, timestampMs: null, confidence: 0.99, words: processedWords, }; // Add gap between sentences currentStartTime = caption.endMs + 500; return caption; }); };
2. Generating Captions via File Upload
This method allows users to upload pre-generated captions from speech recognition services.
const handleFileUpload = (event: React.ChangeEvent<HTMLInputElement>) => { const file = event.target.files?.[0]; if (!file) return; const reader = new FileReader(); reader.onload = (e) => { try { const jsonData = JSON.parse(e.target?.result as string) as WordsFileData; const processedCaptions: Caption[] = []; for (let i = 0; i < jsonData.words.length; i += 5) { const wordChunk = jsonData.words.slice(i, i + 5); const startMs = wordChunk[0].start * 1000; const endMs = wordChunk[wordChunk.length - 1].end * 1000; processedCaptions.push({ text: wordChunk.map((w) => w.word).join(" "), startMs, endMs, timestampMs: null, confidence: wordChunk.reduce((acc, w) => acc + w.confidence, 0) / wordChunk.length, words: wordChunk.map((w) => ({ word: w.word, startMs: w.start * 1000, endMs: w.end * 1000, confidence: w.confidence, })), }); } } }; };
The uploaded JSON file should follow this structure:
{ "words": [ { "word": "Hello", "start": 0.0, "end": 0.5, "confidence": 0.98 }, { "word": "world", "start": 0.6, "end": 1.1, "confidence": 0.95 } ] }
Concerns
How should we be generating captions from videos?
Right now, captions aren't automatically generated from video files inside RVE itself. Users either manually type captions or upload a file from an external speech recognition service. But should RVE handle caption generation directly? Some thoughts:
- Should we be estimating timings based on reading speed?
- Should captions be auto-generated using built-in speech-to-text?
- Is word-level timing too much detail? Should captions just be sentence-based?
There are so many ways to approach this, and I'm still not sure which one makes the most sense
Where I'm at with Version 6
Captions are in beta, and they work. But I don't know if this is how they should work. Should captions be fully manual, automatic, or a mix of both? Should positioning be fully customizable, or should the editor auto-detect placement? Should captions be sentence-based instead of word-by-word?
I need feedback from people actually using React Video Editor to figure this out.
Final Thoughts
Captions are a critical feature, but I don't want to lock into an approach that doesn't match how people actually want to use them. Version 6 is a first step, but I'm leaving things open-ended because I know there's still a lot to refine.
How do you think captions should work in a video editor?

Ready to Build YourNext Video Project?
Join developers worldwide who are already creating amazing video experiences. Get started with our professional template today.
Keep Reading
Explore more related articles
Previous Article
Adding Custom Video Upload Support to the React Video Editor
Learn how to extend the React Video Editor to allow user-uploaded videos with a seamless workflow using Supabase or other storage solutions.
Next Article
Version 6 of React Video Editor
Version 6 of React Video Editor is here! Let's dive into the new features and improvements.