๐ค AI Summary
Existing video captioning approaches struggle to balance visual fidelity and redundancy control: holistic descriptions are overly concise, while segment-level captions suffer from severe redundancy. Inspired by video codecs, this work proposes a keyframe-residual collaborative modeling framework, where keyframe captions encode stable visual context and residual captions capture only local dynamic changes. We introduce, for the first time, video codec principles into dense video captioning by establishing a dual-path keyframe-residual representation mechanism. To support this paradigm, we release CodecVDC-10K, a large-scale dataset, along with VidCapQA, a question-answeringโdriven evaluation benchmark. Experiments demonstrate that our method significantly outperforms strong vision-language model baselines, effectively enhancing caption fidelity while substantially reducing redundancy.
๐ Abstract
Existing video captioning methods struggle to balance visual fidelity and redundancy: holistic captions are compact but lose fine-grained evidence, whereas segment-wise captions improve coverage but introduce heavy redundancy. We propose CodecCap, a codec-inspired framework for high-fidelity dense video captioning. Analogous to video codecs, CodecCap represents videos using keyframe and residual captions. Keyframe captions exhaustively encode stable visual context, while residual captions capture temporally only localized actions, motions and changes. This effectively preserves fine-grained visual evidence while reducing redundant descriptions. To quantify the fidelity of captions, we introduce VidCapQA, a caption-then-QA benchmark with 1,000 questions across 14 capability dimensions. Results on VidCapQA show that captions directly generated by strong VLMs still miss many visual details, highlighting caption representation as a critical bottleneck. Experiments show that CodecCap significantly surpasses direct captioning with the same underlying VLMs, suggesting keyframe-residual captioning a way for high-fidelity video-language supervision. We further use CodecCap to construct CodecVDC-100K, a large-scale dense captioning dataset with anchor, residual, scene-level, and video-level supervision.