🤖 AI Summary
This work addresses the excessive computational cost of large vision-language models in high-resolution image and long video reasoning, where the surge in visual token count poses significant challenges. Existing token compression methods are often hindered by positional bias or neglect global structural information. The study reveals, for the first time, that visual token representations exhibit a stable and pronounced low-rank property. Building on this insight, it introduces a novel criterion identifying tokens poorly explained by the low-rank subspace as critical information. Leveraging principal component analysis (PCA), the method computes projection residuals to enable training-free, efficient token pruning. Remarkably, it retains only 11.1% of tokens in image understanding while preserving 94.7% of performance, and achieves 97.8% accuracy in video understanding with just 12.5% of tokens retained.
📝 Abstract
Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.