FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression

📅 2025-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale vision-language models (VLLMs) oriented toward text understanding suffer from visual token redundancy and low training/inference efficiency when processing high-resolution images; existing training-free compression methods exhibit poor semantic fidelity and substantial performance degradation on text-intensive tasks. This paper proposes the first two-stage visual token compression framework tailored for text-oriented VLLMs: Stage I employs lightweight self-distillation pretraining, while Stage II introduces task-aware post-training coupled with a dynamic token selection-and-reconstruction mechanism—requiring only a small number of image-text pairs and minimal learnable parameters. Evaluated on InternVL2, our method outperforms baselines across multiple text-centric benchmarks (e.g., DocVQA, OCR-VQA), reducing GPU memory consumption by 42%, FLOPs by 38%, and accelerating inference by 2.1×, all while preserving high semantic fidelity.

Technology Category

Application Category

📝 Abstract
The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution, text-oriented image understanding and reasoning. In this paper, we propose an efficient visual token compression framework for text-oriented VLLMs in high-resolution scenarios. In particular, we employ a light-weight self-distillation pre-training stage to compress the visual tokens, requiring a limited numbers of image-text pairs and minimal learnable parameters. Afterwards, to mitigate potential performance degradation of token-compressed models, we construct a high-quality post-train stage. To validate the effectiveness of our method, we apply it to an advanced VLLMs, InternVL2. Experimental results show that our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks. We will release the models and code soon.
Problem

Research questions and friction points this paper is trying to address.

Efficient visual token compression
Text-oriented image understanding
Minimizing computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient visual token compression
Light-weight self-distillation pre-training
High-quality post-train stage
🔎 Similar Papers
J
Jianjian Li
University of Science and Technology of China.
J
Junquan Fan
University of Science and Technology of China.
Feng Tang
Feng Tang
Apple Inc.
Computer visionmachine learning and multimedia
G
Gang Huang
Huawei Noah’s Ark Lab.
S
Shitao Zhu
Huawei Noah’s Ark Lab.
S
Songlin Liu
Huawei Noah’s Ark Lab.
N
Nian Xie
Huawei Noah’s Ark Lab.
Wulong Liu
Wulong Liu
Unknown affiliation
Reinforcement LearningAutonomous DrivingRoboticsAI InfraEDA
Yong Liao
Yong Liao
University of Science and Technology of China
network securitydata miningInternet routingnetwork virtualization