🤖 AI Summary
This work addresses the lack of fine-grained temporal modeling in video captioning by proposing a novel frame-level progress-aware captioning task: models must not only accurately describe individual frames but also explicitly capture the incremental evolution of actions across consecutive frames. To support this task, we introduce FrameCap, the first dedicated dataset, and FrameCapEval, a corresponding benchmark for quantitative evaluation. We further propose ProgressCaptioner—a unified architecture integrating inter-frame difference modeling, temporal attention mechanisms, and progressive prompt learning. Extensive experiments demonstrate that ProgressCaptioner significantly outperforms existing state-of-the-art methods on FrameCapEval, generating captions that faithfully reflect action progression. This work bridges a critical gap in fine-grained temporal language generation for videos and establishes a new paradigm—along with practical tools—for keyframe selection and deep video understanding.
📝 Abstract
While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level. This novel task aims to generate temporally fine-grained captions that not only accurately describe each frame but also capture the subtle progression of actions throughout a video sequence. Despite the strong capabilities of existing leading vision language models, they often struggle to discern the nuances of frame-wise differences. To address this, we propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence. Alongside, we develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality. The results demonstrate that ProgressCaptioner significantly surpasses leading captioning models, producing precise captions that accurately capture action progression and set a new standard for temporal precision in video captioning. Finally, we showcase practical applications of our approach, specifically in aiding keyframe selection and advancing video understanding, highlighting its broad utility.