Event-Anchored Frame Selection for Effective Long-Video Understanding

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long video understanding is hindered by frame redundancy and the limited context window of vision-language models, necessitating an efficient frame selection mechanism. This work proposes a hierarchical, event-aware, plug-and-play frame selection method: it first partitions the video into semantic event segments using self-supervised DINO embeddings, then selects query-relevant anchor frames within each segment, and finally applies adaptive Maximal Marginal Relevance (MMR) for global optimization to balance event coverage, query relevance, and visual diversity. By introducing the novel concept of event anchoring, this approach breaks away from conventional flat sampling paradigms. When integrated into LLaVA-Video-7B, it achieves accuracy improvements of 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.

Technology Category

Application Category

📝 Abstract
Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual diversity. As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs, yielding substantial gains on challenging video understanding benchmarks. Specifically, when applied to LLaVA-Video-7B, EFS improves accuracy by 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.
Problem

Research questions and friction points this paper is trying to address.

long-video understanding
frame selection
large vision-language models
frame redundancy
context window
Innovation

Methods, ideas, or system contributions that make the work stand out.

Event-Anchored Frame Selection
Long-Video Understanding
Self-Supervised Embeddings
Maximal Marginal Relevance
Large Vision-Language Models
🔎 Similar Papers
No similar papers found.
Wang Chen
Wang Chen
Individual Researcher
Natural Language ProcessingText GenerationInformation Extraction
Y
Yongdong Luo
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Y
Yuhui Zeng
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
L
Luojun Lin
College of Computer and Data Science, Fuzhou University
T
Tianyu Xie
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
F
Fei Chao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Xiawu Zheng
Xiawu Zheng
Associate Professor, IEEE Senior Member, Xiamen University
Automated Machine LearningNetwork CompressionNeural Architecture SearchAutoML