Event-Anchored Frame Selection for Effective Long-Video Understanding

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Long video understanding is hindered by frame redundancy and the limited context window of vision-language models, necessitating an efficient frame selection mechanism. This work proposes a hierarchical, event-aware, plug-and-play frame selection method: it first partitions the video into semantic event segments using self-supervised DINO embeddings, then selects query-relevant anchor frames within each segment, and finally applies adaptive Maximal Marginal Relevance (MMR) for global optimization to balance event coverage, query relevance, and visual diversity. By introducing the novel concept of event anchoring, this approach breaks away from conventional flat sampling paradigms. When integrated into LLaVA-Video-7B, it achieves accuracy improvements of 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.

Technology Category

Application Category

📝 Abstract

Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual diversity. As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs, yielding substantial gains on challenging video understanding benchmarks. Specifically, when applied to LLaVA-Video-7B, EFS improves accuracy by 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.

Problem

Research questions and friction points this paper is trying to address.

long-video understanding

frame selection

large vision-language models

frame redundancy

context window

Innovation

Methods, ideas, or system contributions that make the work stand out.

Event-Anchored Frame Selection

Long-Video Understanding

Self-Supervised Embeddings