🤖 AI Summary
Existing multimodal large language models (M-LLMs) for long-video understanding rely on uniform frame sampling, which often discards query-relevant key frames and degrades question-answering performance. To address this, we propose a lightweight, plug-and-play query-adaptive frame selection method that requires no fine-tuning of downstream M-LLMs. Our core innovation is a novel dual-signal supervision framework: a spatial signal, where the M-LLM scores individual frames to assess their static relevance; and a temporal signal, where an LLM models inter-frame temporal dependencies based on frame captions, enabling subtitle-guided dynamic selection. Evaluated on mid-to-long video QA benchmarks—including ActivityNet, NExT-QA, EgoSchema, and LongVideoBench—our method consistently improves reasoning accuracy across diverse video-LLM architectures. Results demonstrate that adaptive frame selection is critical for effective long-video comprehension.
📝 Abstract
Recent advances in Multi-Modal Large Language Models (M-LLMs) show promising results in video reasoning. Popular Multi-Modal Large Language Model (M-LLM) frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an M-LLM, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream M-LLM may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight M-LLM -based frame selection method that adaptively select frames that are more relevant to users' queries. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a M-LLM; (ii) Temporal signal, in which multiple frames selection by prompting Large Language Model (LLM) using the captions of all frame candidates. The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering. Empirical results show that the proposed M-LLM video frame selector improves the performances various downstream video Large Language Model (video-LLM) across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.