DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding

📅 2024-11-19
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Long-form video understanding faces a fundamental trade-off between visual-semantic fidelity and token budget constraints. To address this, we propose a dynamic collaborative encoding framework that performs question-driven frame-level importance modeling for adaptive key-frame selection and differential encoding. Our contributions are twofold: (1) a novel Dynamic Event Prototype Estimation module, which statistically models event evolution to guide key-frame sampling; and (2) a compact Collaborative Encoding module integrating hierarchical visual encoders (CLIP + ViT), question-conditioned attention, and a lightweight cross-frame collaboration network to construct hybrid fine-grained/coarse-grained representations. Evaluated on five mainstream video QA benchmarks, our method achieves state-of-the-art or near-state-of-the-art accuracy while reducing token consumption by 37% and accelerating inference by 2.1×.

Technology Category

Application Category

📝 Abstract
The challenge in LLM-based video understanding lies in preserving visual and semantic information in long videos while maintaining a memory-affordable token count. However, redundancy and correspondence in videos have hindered the performance potential of existing methods. Through statistical learning on current datasets, we observe that redundancy occurs in both repeated and answer-irrelevant frames, and the corresponding frames vary with different questions. This suggests the possibility of adopting dynamic encoding to balance detailed video information preservation with token budget reduction. To this end, we propose a dynamic cooperative network, DynFocus, for memory-efficient video encoding in this paper. Specifically, i) a Dynamic Event Prototype Estimation (DPE) module to dynamically select meaningful frames for question answering; (ii) a Compact Cooperative Encoding (CCE) module that encodes meaningful frames with detailed visual appearance and the remaining frames with sketchy perception separately. We evaluate our method on five publicly available benchmarks, and experimental results consistently demonstrate that our method achieves competitive performance.
Problem

Research questions and friction points this paper is trying to address.

Balancing visual and semantic preservation in long videos
Reducing redundancy and irrelevant frames in video encoding
Dynamic frame selection for efficient video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Event Prototype Estimation for frame selection
Compact Cooperative Encoding for efficient video encoding
Balances detail preservation with token reduction
🔎 Similar Papers
No similar papers found.
Y
Yudong Han
Beijing Institute of Technology
Qingpei Guo
Qingpei Guo
Ant Group
Multimodal LLMsVision-Language Models
Liyuan Pan
Liyuan Pan
Beijing Institute of Technology
Computer vision
L
Liu Liu
KooMap Dept., Huawei
Y
Yu Guan
University of Warwick
M
Ming Yang
Ant Group