🤖 AI Summary
In zero-shot video captioning, the absence of paired video–text annotations leads to incomplete and inaccurate descriptions. To address this, we propose a progressive multi-granularity prompting framework that operates without paired supervision. Our method constructs a three-level structured memory bank—comprising noun phrases, scene graphs, and full sentences—and introduces a class-aware natural language distribution retrieval mechanism. This mechanism leverages CLIP-based vision–language alignment and scene graph parsing to enable fine-grained content recall. Evaluated on MSR-VTT, MSVD, and VATEX benchmarks, our approach achieves absolute CIDEr improvements of 5.7%, 16.2%, and 3.4%, respectively, significantly outperforming state-of-the-art zero-shot methods. These results validate the effectiveness of explicitly modeling scene content distributions for enhancing both descriptive completeness and accuracy.
📝 Abstract
Zero-shot video captioning requires that a model generate high-quality captions without human-annotated video-text pairs for training. State-of-the-art approaches to the problem leverage CLIP to extract visual-relevant textual prompts to guide language models in generating captions. These methods tend to focus on one key aspect of the scene and build a caption that ignores the rest of the visual input. To address this issue, and generate more accurate and complete captions, we propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences. Moreover, we introduce a category-aware retrieval mechanism that models the distribution of natural language surrounding the specific topics in question. Extensive experiments demonstrate the effectiveness of our method with 5.7%, 16.2%, and 3.4% improvements in terms of the main metric CIDEr on MSR-VTT, MSVD, and VATEX benchmarks compared to existing state-of-the-art.