FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression

📅 2025-02-22

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Large-scale vision-language models (VLLMs) oriented toward text understanding suffer from visual token redundancy and low training/inference efficiency when processing high-resolution images; existing training-free compression methods exhibit poor semantic fidelity and substantial performance degradation on text-intensive tasks. This paper proposes the first two-stage visual token compression framework tailored for text-oriented VLLMs: Stage I employs lightweight self-distillation pretraining, while Stage II introduces task-aware post-training coupled with a dynamic token selection-and-reconstruction mechanism—requiring only a small number of image-text pairs and minimal learnable parameters. Evaluated on InternVL2, our method outperforms baselines across multiple text-centric benchmarks (e.g., DocVQA, OCR-VQA), reducing GPU memory consumption by 42%, FLOPs by 38%, and accelerating inference by 2.1×, all while preserving high semantic fidelity.

Technology Category

Application Category

📝 Abstract

The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution, text-oriented image understanding and reasoning. In this paper, we propose an efficient visual token compression framework for text-oriented VLLMs in high-resolution scenarios. In particular, we employ a light-weight self-distillation pre-training stage to compress the visual tokens, requiring a limited numbers of image-text pairs and minimal learnable parameters. Afterwards, to mitigate potential performance degradation of token-compressed models, we construct a high-quality post-train stage. To validate the effectiveness of our method, we apply it to an advanced VLLMs, InternVL2. Experimental results show that our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks. We will release the models and code soon.

Problem

Research questions and friction points this paper is trying to address.

Efficient visual token compression

Text-oriented image understanding

Minimizing computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient visual token compression

Light-weight self-distillation pre-training

High-quality post-train stage

🔎 Similar Papers

VoCo-LLaMA: Towards Vision Compression with Large Language Models