How Should Captions Work in a Video Editor?

As part of Version 6 of React Video Editor, we're introducing a new captions feature. This is the most requested feature so far, and we're excited to ship it.

That being said, getting captions right is tricky. React Video Editor is designed as a foundation that developers can build on top of. The goal is to give users full control over their video tools without vendor lock-in or constraints on implementation.

So, captions are now part of the editor, but they're launching in beta. There are a lot of different ways to handle captions, and this post is meant to explore those approaches. I'll explain how captions currently work in the codebase and open up the discussion on what could be improved.

What Captions Look Like in RVE

Before we get into the details, here's an example of what captions look like inside React Video Editor.

Uploading and generating captions:

Write your own captions or upload a file from a speech-to-text service.

Caption editor interface showing uploading and adding captions

Editing captions:

Users can now edit the words within captions.

Caption editor interface showing editing captions and text input

Styling captions:

Choose your own colors and styles for captions.

Caption editor interface showing styling captions

How Captions Work in React Video Editor

Captions in React Video Editor are structured like other overlays but are specifically designed to handle timed text. Currently, captions can be generated in two ways:

Manual Input – Users type their text, and captions are generated with estimated timings.
File Upload – Users upload JSON-based speech recognition data, which generates captions with precise word-level timing.

Regardless of how they're created, captions are fully editable inside the timeline and support custom styling.

1. Generating Captions via Manual Input

The first method allows users to manually enter text, and RVE will generate timing based on estimated reading speed.

const generateCaptions = () => {
  // Split text into sentences using punctuation
  const sentences = script
    .split(/[.!?]+/)
    .map((sentence) => sentence.trim())
    .filter((sentence) => sentence.length > 0);

  // Calculate timing based on average reading speed
  const wordsPerMinute = 160;
  const msPerWord = (60 * 1000) / wordsPerMinute;
  let currentStartTime = 0;

  const processedCaptions: Caption[] = sentences.map((sentence) => {
    const words = sentence.split(/\s+/);
    const sentenceStartTime = currentStartTime;

    // Create timing for each word
    const processedWords = words.map((word, index) => ({
      word,
      startMs: sentenceStartTime + index * msPerWord,
      endMs: sentenceStartTime + (index + 1) * msPerWord,
      confidence: 0.99,
    }));

    // Create caption segment
    const caption: Caption = {
      text: sentence,
      startMs: sentenceStartTime,
      endMs: sentenceStartTime + words.length * msPerWord,
      timestampMs: null,
      confidence: 0.99,
      words: processedWords,
    };

    // Add gap between sentences
    currentStartTime = caption.endMs + 500;
    return caption;
  });
};

2. Generating Captions via File Upload

This method allows users to upload pre-generated captions from speech recognition services.

const handleFileUpload = (event: React.ChangeEvent<HTMLInputElement>) => {
  const file = event.target.files?.[0];
  if (!file) return;

  const reader = new FileReader();
  reader.onload = (e) => {
    try {
      const jsonData = JSON.parse(e.target?.result as string) as WordsFileData;

      const processedCaptions: Caption[] = [];
      for (let i = 0; i < jsonData.words.length; i += 5) {
        const wordChunk = jsonData.words.slice(i, i + 5);
        const startMs = wordChunk[0].start * 1000;
        const endMs = wordChunk[wordChunk.length - 1].end * 1000;

        processedCaptions.push({
          text: wordChunk.map((w) => w.word).join(" "),
          startMs,
          endMs,
          timestampMs: null,
          confidence: wordChunk.reduce((acc, w) => acc + w.confidence, 0) / wordChunk.length,
          words: wordChunk.map((w) => ({
            word: w.word,
            startMs: w.start * 1000,
            endMs: w.end * 1000,
            confidence: w.confidence,
          })),
        });
      }
    }
  };
};

The uploaded JSON file should follow this structure:

{
  "words": [
    {
      "word": "Hello",
      "start": 0.0,
      "end": 0.5,
      "confidence": 0.98
    },
    {
      "word": "world",
      "start": 0.6,
      "end": 1.1,
      "confidence": 0.95
    }
  ]
}

Concerns

How should we be generating captions from videos?

Right now, captions aren't automatically generated from video files inside RVE itself. Users either manually type captions or upload a file from an external speech recognition service. But should RVE handle caption generation directly? Some thoughts:

Should we be estimating timings based on reading speed?
Should captions be auto-generated using built-in speech-to-text?
Is word-level timing too much detail? Should captions just be sentence-based?

There are so many ways to approach this, and I'm still not sure which one makes the most sense

Where I'm at with Version 6

Captions are in beta, and they work. But I don't know if this is how they should work. Should captions be fully manual, automatic, or a mix of both? Should positioning be fully customizable, or should the editor auto-detect placement? Should captions be sentence-based instead of word-by-word?

I need feedback from people actually using React Video Editor to figure this out.

Final Thoughts

Captions are a critical feature, but I don't want to lock into an approach that doesn't match how people actually want to use them. Version 6 is a first step, but I'm leaving things open-ended because I know there's still a lot to refine.

How do you think captions should work in a video editor?