The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In zero-shot video captioning, the absence of paired video–text annotations leads to incomplete and inaccurate descriptions. To address this, we propose a progressive multi-granularity prompting framework that operates without paired supervision. Our method constructs a three-level structured memory bank—comprising noun phrases, scene graphs, and full sentences—and introduces a class-aware natural language distribution retrieval mechanism. This mechanism leverages CLIP-based vision–language alignment and scene graph parsing to enable fine-grained content recall. Evaluated on MSR-VTT, MSVD, and VATEX benchmarks, our approach achieves absolute CIDEr improvements of 5.7%, 16.2%, and 3.4%, respectively, significantly outperforming state-of-the-art zero-shot methods. These results validate the effectiveness of explicitly modeling scene content distributions for enhancing both descriptive completeness and accuracy.

Technology Category

Application Category

📝 Abstract

Zero-shot video captioning requires that a model generate high-quality captions without human-annotated video-text pairs for training. State-of-the-art approaches to the problem leverage CLIP to extract visual-relevant textual prompts to guide language models in generating captions. These methods tend to focus on one key aspect of the scene and build a caption that ignores the rest of the visual input. To address this issue, and generate more accurate and complete captions, we propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences. Moreover, we introduce a category-aware retrieval mechanism that models the distribution of natural language surrounding the specific topics in question. Extensive experiments demonstrate the effectiveness of our method with 5.7%, 16.2%, and 3.4% improvements in terms of the main metric CIDEr on MSR-VTT, MSVD, and VATEX benchmarks compared to existing state-of-the-art.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot video captioning without human-annotated training pairs

Generating complete captions by modeling scene content distribution

Improving accuracy via progressive multi-granularity textual prompting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive multi-granularity textual prompting strategy

Three distinct memory banks construction

Category-aware retrieval mechanism modeling distributions

🔎 Similar Papers

No similar papers found.

Authors to Follow