π€ AI Summary
This work addresses the limitations of conventional video question answering methods that rely on uniform sampling of a fixed number of frames, which often fail to capture question-relevant visual content and thereby hinder answer accuracy. To overcome this, the authors propose a query-driven adaptive keyframe selection mechanism that, for the first time, integrates submodular mutual information (SMI) functions into video QA to dynamically select frames most semantically aligned with the input question. The method is seamlessly integrated in an end-to-end manner with vision-language models such as Video-LLaVA and LLaVA-NeXT. Experiments on the MVBench dataset demonstrate up to a 4% improvement in QA accuracy over uniform sampling, and qualitative analysis confirms that the selected frames better align with question semantics, significantly enhancing the modelβs contextual understanding.
π Abstract
Video Question Answering (VideoQA) models enhance understanding and interaction with audiovisual content, making it more accessible, searchable, and useful for a wide range of fields such as education, surveillance, entertainment, and content creation. Due to heavy compute requirements, most large visual language models (VLMs) for VideoQA rely on a fixed number of frames by uniformly sampling the video. However, this process does not pick important frames or capture the context of the video. We present a novel query-based selection of frames relevant to the questions based on the submodular mutual Information (SMI) functions. By replacing uniform frame sampling with query-based selection, our method ensures that the chosen frames provide complementary and essential visual information for accurate VideoQA. We evaluate our approach on the MVBench dataset, which spans a diverse set of multi-action video tasks. VideoQA accuracy on this dataset was assessed using two VLMs, namely Video-LLaVA and LLaVA-NeXT, both of which originally employed uniform frame sampling. Experiments were conducted using both uniform and query-based sampling strategies. An accuracy improvement of up to \textbf{4\%} was observed when using query-based frame selection over uniform sampling. Qualitative analysis further highlights that query-based selection, using SMI functions, consistently picks frames better aligned with the question. We opine that such query-based frame selection can enhance accuracy in a wide range of tasks that rely on only a subset of video frames.