Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses automatic chapter segmentation for hour-long videos. We propose an end-to-end, speech-guided chaptering method that jointly leverages timestamped ASR transcripts and lightweight keyframe captions—selectively generated via speech-driven filtering to avoid costly full-frame annotation—and feeds them into a large-context LLM (Llama architecture) for single-pass inference to directly predict chapter boundaries and free-text titles. Our key contributions are: (1) the first end-to-end, single-forward-pass chaptering framework capable of processing full-hour videos; and (2) a novel speech-aware keyframe selection strategy that improves both computational efficiency and semantic alignment with audio content. Evaluated on the VidChapters-7M benchmark, our method achieves an F1 score of 45.3, substantially outperforming the prior state of the art (26.7). The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our 'Chapter-Llama' framework. Specifically, we leverage a pretrained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models at our project page.

Problem

Research questions and friction points this paper is trying to address.

Automatically partitioning long videos into semantic chapters

Generating descriptive titles for video chapters efficiently

Improving navigation and content retrieval in hour-long videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging LLM with large context window

Speech-guided frame selection strategy

Single forward pass for hour-long videos

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs