🤖 AI Summary
Existing video multimodal large language models (V-MLLMs) rely on attention-score-based token compression, which often leads to incomplete semantic coverage and redundancy. To address this, we propose a training-free spatio-temporal collaborative token compression framework. Our approach introduces the novel concept of Semantic Connected Components (SCCs): by performing semantic connectivity analysis, it identifies non-overlapping, semantically complete token subsets and jointly prunes tokens across both spatial and temporal dimensions, enabling efficient yet faithful video representation compression. Crucially, it eliminates dependence on attention scores, thereby significantly improving semantic completeness—especially at low compression ratios. Experiments demonstrate that our method consistently outperforms state-of-the-art approaches across diverse video understanding benchmarks, including video question answering, long-video comprehension, and multiple-choice tasks. Notably, it achieves substantial performance gains under extreme compression (e.g., 10% token retention), highlighting its robustness and fidelity.
📝 Abstract
In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.