🤖 AI Summary
To address the excessive token consumption, performance degradation (“less is more” phenomenon), and redundant key-frame selection (“visual echo”) in multimodal large language models (MLLMs) for video question answering (Video-QA), this work proposes an adaptive frame pruning and semantic graph fusion framework. Methodologically: (1) hierarchical clustering-based pruning is performed in the joint feature space of ResNet-50 and CLIP to eliminate temporal redundancy; (2) a lightweight, text-driven semantic graph is introduced to recover contextual information lost due to frame pruning at minimal computational cost. Extensive experiments across multiple state-of-the-art MLLMs and Video-QA benchmarks demonstrate that our approach achieves up to 86.9% frame compression and 83.2% token reduction, while maintaining or even surpassing the accuracy of full-frame baselines—thereby significantly improving the efficiency–accuracy trade-off.
📝 Abstract
The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While increasing the number of sampled frames is a common strategy, we observe a "less is more" phenomenon where excessive frames can paradoxically degrade performance due to context dilution. Concurrently, state-of-the-art keyframe selection methods, while effective, still yield significant temporal redundancy, which we term 'visual echoes'. To address these dual challenges, we propose Adaptive Frame-Pruning (AFP), a novel post-processing method that intelligently prunes the selected keyframes. AFP employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives. To compensate for information loss, we then introduce a lightweight, text-based semantic graph that provides critical context with minimal token overhead. Conducting extensive experiments on the LongVideoBench and VideoMME benchmarks across multiple leading MLLMs, our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%. Crucially, by providing a concise, high-quality set of frames, our method not only enhances efficiency but often improves accuracy over baselines that use more frames. The code will be released upon publication.