HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

πŸ“… 2026-03-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing frame sampling strategies in video question answering often fail to adapt to downstream tasks, limiting both the efficiency and performance of vision-language models (VLMs). To address this, we propose HORNetβ€”a lightweight, task-guided dynamic frame selection method that employs Group Relative Policy Optimization (GRPO) to train a compact policy network (<1M parameters) for selecting key frames for a frozen VLM. Built upon the newly introduced Select Any Frames (SAF) paradigm, HORNet decouples frame selection from inference, enabling cross-model transfer without retraining. Evaluated across six benchmarks, HORNet achieves 93% inference speedup using only 1% of the original frames, improves F1 by 1.7% on MSVD-QA, outperforms uniform sampling by 7.3 points on NExT-QA, and yields an 8.5% relative gain when applied to stronger VLMs.

Technology Category

Application Category

πŸ“ Abstract
Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet's policy further transfers across VLM answerers without retraining, yielding an additional 8.5\% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at https://github.com/ostadabbas/HORNet.
Problem

Research questions and friction points this paper is trying to address.

video question answering
frame selection
vision-language models
temporal reasoning
input efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

frame selection
vision-language models
video question answering
policy optimization
input efficiency
πŸ”Ž Similar Papers
No similar papers found.