CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion

📅 2024-08-21
🏛️ AAAI Conference on Artificial Intelligence
📈 Citations: 12
Influential: 0
📄 PDF
🤖 AI Summary
Existing video saliency prediction methods primarily focus on perceptual modeling while neglecting language-guided reasoning—particularly explicit modeling of ranking cues for salient objects. To address this, we propose Video Silent Object Ranking Chain-of-Thought (VSOR-CoT), a novel paradigm that emulates the cognitive sequence of “observing → describing → ranking → generating.” For the first time, we introduce a language-driven object ranking graph as a critical conditional input to a diffusion model, integrated with a multimodal large language model, a vision-language grounding module, and chain-of-thought prompting engineering to enable end-to-end saliency map generation. Our method achieves state-of-the-art performance on the MVS benchmark and demonstrates zero-shot cross-domain generalization on DHF1K. It significantly improves temporal consistency and fine-grained spatial localization accuracy, establishing a new foundation for language-augmented video saliency prediction.

Technology Category

Application Category

📝 Abstract
Video saliency prediction aims to identify the regions in a video that attract human attention and gaze, driven by bottom-up features from the video and top-down processes like memory and cognition. Among these top-down influences, language plays a crucial role in guiding attention by shaping how visual information is interpreted. Existing methods primarily focus on modeling perceptual information while neglecting the reasoning process facilitated by language, where ranking cues are crucial outcomes of this process and practical guidance for saliency prediction. In this paper, we propose CaRDiff (Caption, Rank, and generate with Diffusion), a framework that imitates the process by integrating multimodal large language model (MLLM), a grounding module, and a diffusion model, to enhance video saliency prediction. Specifically, we introduce a novel prompting method VSOR-CoT (Video Slient Object Ranking Chain of Thought), which utilizes an MLLM with a grounding module to caption video content and infer salient objects along with their rankings and positions. This process derives ranking maps that can be sufficiently leveraged by the diffusion model to accurately decode the saliency maps for the given video. Extensive experiments showcase the effectiveness of VSOR-CoT in improving the performance of video saliency prediction. The proposed CaRDiff performs better than state-of-the-art models on the MVS dataset and demonstrates cross-dataset capabilities on the DHF1k dataset through zero-shot evaluation.
Problem

Research questions and friction points this paper is trying to address.

Modeling language-guided reasoning for video saliency prediction
Integrating ranking cues from multimodal language models
Improving saliency map accuracy using diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLM with grounding for captioning and ranking
Generates ranking maps to guide diffusion model
VSOR-CoT prompting improves video saliency prediction
🔎 Similar Papers
No similar papers found.