CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning

๐Ÿ“… 2026-05-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing video captioning approaches struggle to balance visual fidelity and redundancy control: holistic descriptions are overly concise, while segment-level captions suffer from severe redundancy. Inspired by video codecs, this work proposes a keyframe-residual collaborative modeling framework, where keyframe captions encode stable visual context and residual captions capture only local dynamic changes. We introduce, for the first time, video codec principles into dense video captioning by establishing a dual-path keyframe-residual representation mechanism. To support this paradigm, we release CodecVDC-10K, a large-scale dataset, along with VidCapQA, a question-answeringโ€“driven evaluation benchmark. Experiments demonstrate that our method significantly outperforms strong vision-language model baselines, effectively enhancing caption fidelity while substantially reducing redundancy.
๐Ÿ“ Abstract
Existing video captioning methods struggle to balance visual fidelity and redundancy: holistic captions are compact but lose fine-grained evidence, whereas segment-wise captions improve coverage but introduce heavy redundancy. We propose CodecCap, a codec-inspired framework for high-fidelity dense video captioning. Analogous to video codecs, CodecCap represents videos using keyframe and residual captions. Keyframe captions exhaustively encode stable visual context, while residual captions capture temporally only localized actions, motions and changes. This effectively preserves fine-grained visual evidence while reducing redundant descriptions. To quantify the fidelity of captions, we introduce VidCapQA, a caption-then-QA benchmark with 1,000 questions across 14 capability dimensions. Results on VidCapQA show that captions directly generated by strong VLMs still miss many visual details, highlighting caption representation as a critical bottleneck. Experiments show that CodecCap significantly surpasses direct captioning with the same underlying VLMs, suggesting keyframe-residual captioning a way for high-fidelity video-language supervision. We further use CodecCap to construct CodecVDC-100K, a large-scale dense captioning dataset with anchor, residual, scene-level, and video-level supervision.
Problem

Research questions and friction points this paper is trying to address.

video captioning
visual fidelity
redundancy
dense captioning
fine-grained evidence
Innovation

Methods, ideas, or system contributions that make the work stand out.

dense video captioning
codec-inspired modeling
keyframe-residual representation
high-fidelity captioning
video-language supervision
๐Ÿ”Ž Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30