🤖 AI Summary
Large-scale vision-language models (VLLMs) oriented toward text understanding suffer from visual token redundancy and low training/inference efficiency when processing high-resolution images; existing training-free compression methods exhibit poor semantic fidelity and substantial performance degradation on text-intensive tasks. This paper proposes the first two-stage visual token compression framework tailored for text-oriented VLLMs: Stage I employs lightweight self-distillation pretraining, while Stage II introduces task-aware post-training coupled with a dynamic token selection-and-reconstruction mechanism—requiring only a small number of image-text pairs and minimal learnable parameters. Evaluated on InternVL2, our method outperforms baselines across multiple text-centric benchmarks (e.g., DocVQA, OCR-VQA), reducing GPU memory consumption by 42%, FLOPs by 38%, and accelerating inference by 2.1×, all while preserving high semantic fidelity.
📝 Abstract
The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution, text-oriented image understanding and reasoning. In this paper, we propose an efficient visual token compression framework for text-oriented VLLMs in high-resolution scenarios. In particular, we employ a light-weight self-distillation pre-training stage to compress the visual tokens, requiring a limited numbers of image-text pairs and minimal learnable parameters. Afterwards, to mitigate potential performance degradation of token-compressed models, we construct a high-quality post-train stage. To validate the effectiveness of our method, we apply it to an advanced VLLMs, InternVL2. Experimental results show that our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks. We will release the models and code soon.