Adaptive Greedy Frame Selection for Long Video Understanding

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the challenges of inefficient reasoning and missed critical frames in long-form video question answering. The authors propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Frame quality is evaluated in dual embedding spaces using SigLIP and DINOv2, while a facility location coverage term enhances diversity. The approach employs a normalized, monotonic, and submodular objective function, guaranteeing a (1−1/e)-approximation. Additionally, a lightweight text-only question-type classifier dynamically selects the optimal sampling strategy. Evaluated on the MLVU dataset, the method significantly outperforms uniform sampling and existing strong baselines, with particularly notable gains under low frame budgets.

Technology Category

Application Category

📝 Abstract

Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.

Problem

Research questions and friction points this paper is trying to address.

long video understanding

frame selection

vision-language models

temporal coverage

query relevance

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive frame selection

submodular optimization

long-video understanding