ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of keyframe selection under input token constraints in vision-language models (VLMs) and temporal sparsity across video frames, this paper proposes a causality-driven reinforcement search framework. Methodologically, it introduces a learnable policy network that jointly models visual relevance, counterfactual causal intervention scores, and a composite reward grounded in the *causal information bottleneck* principle—enabling dynamic, context-aware keyframe selection via reinforcement learning. Crucially, this work is the first to formalize keyframes as frames satisfying both *predictive sufficiency* and *causal necessity*. Evaluated on NExT-QA, EgoSchema, and Video-MME benchmarks, the approach consistently outperforms state-of-the-art methods, especially under extreme token budgets (e.g., 4–8 frames), demonstrating superior accuracy and generalization. The core contributions include: (1) a novel causal formalization of keyframes; (2) integration of causal information bottleneck into reward design; and (3) an end-to-end trainable reinforcement framework for temporally grounded video understanding.

Technology Category

Application Category

📝 Abstract
Keyframe selection has become essential for video understanding with vision-language models (VLMs) due to limited input tokens and the temporal sparsity of relevant information across video frames. Video understanding often relies on effective keyframes that are not only informative but also causally decisive. To this end, we propose Reinforced Causal Search with Information Bottleneck (ReaSon), a framework that formulates keyframe selection as an optimization problem with the help of a novel Causal Information Bottleneck (CIB), which explicitly defines keyframes as those satisfying both predictive sufficiency and causal necessity. Specifically, ReaSon employs a learnable policy network to select keyframes from a visually relevant pool of candidate frames to capture predictive sufficiency, and then assesses causal necessity via counterfactual interventions. Finally, a composite reward aligned with the CIB principle is designed to guide the selection policy through reinforcement learning. Extensive experiments on NExT-QA, EgoSchema, and Video-MME demonstrate that ReaSon consistently outperforms existing state-of-the-art methods under limited-frame settings, validating its effectiveness and generalization ability.
Problem

Research questions and friction points this paper is trying to address.

Selecting causally decisive keyframes for video understanding with VLMs
Optimizing keyframe selection using predictive sufficiency and causal necessity
Improving video understanding under limited-frame constraints through reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning optimizes keyframe selection policy
Causal Information Bottleneck ensures predictive and causal criteria
Counterfactual interventions assess causal necessity in video understanding
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
Y
Yuan Zhou
Nanjing University of Information Science and Technology
L
Litao Hua
Nanjing University of Information Science and Technology
S
Shilong Jin
Nanjing University of Information Science and Technology
W
Wentao Huang
Nanjing University of Information Science and Technology
Haoran Duan
Haoran Duan
Tsinghua/Newcastle/Durham University
Multimodal AIGenerative AI