TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work proposes the Omni Dense Captioning task, which aims to generate temporally precise, fine-grained, and structured audio-visual descriptions across multiple scenes to support screenplay-like continuous narratives. To this end, we introduce the first six-dimensional structured captioning framework, construct a high-quality benchmark named OmniDCBench, and develop a unified evaluation metric, SodaM. Leveraging the TimeChatCap-42K dataset, we train the TimeChat-Captioner-7B model using a hybrid strategy of supervised fine-tuning (SFT) and group relative policy optimization (GRPO) augmented with task-specific rewards. The resulting model outperforms Gemini-2.5-Pro in dense caption generation and significantly enhances performance on downstream audio-visual reasoning and temporal localization tasks.

Technology Category

Application Category

📝 Abstract

This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create"script-like"captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code will be made publicly available at https://github.com/yaolinli/TimeChat-Captioner.

Problem

Research questions and friction points this paper is trying to address.

dense captioning

audio-visual narratives

time-aware captioning

structured video description

multi-scene video

Innovation

Methods, ideas, or system contributions that make the work stand out.

Omni Dense Captioning

structured audio-visual captioning

time-aware narration