VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically investigates, for the first time, how visual-textual compression (VTC) impacts the long-context understanding capabilities of vision-language models (VLMs). To this end, we introduce VTCBench—the first long-context benchmark specifically designed for VTC evaluation—covering retrieval, reasoning, and memory tasks, along with its real-world extension, VTCBench-Wild. We propose a multi-dimensional evaluation framework integrating OCR encoding, 2D dense representation compression, cross-modal attention analysis, and dialogue memory tracking, enabling fine-grained, unified assessment of both open- and closed-source VLMs. Experimental results reveal that while state-of-the-art VLMs accurately decode OCR-extracted text, their long-range factual association and implicit reasoning abilities degrade substantially under VTC compression—causing an average performance drop of 42.7% across all three task categories. This exposes a critical semantic fidelity gap in current VTC methods, directly undermining VLMs’ contextual comprehension.

Technology Category

Application Category

📝 Abstract
The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs' long-context understanding with vision-text compression
Assesses retrieval, reasoning, and memory in compressed visual representations
Reveals VLMs' poor performance in capturing long associations from VTC
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compress long text into dense 2D visual representations
Introduce benchmark to assess VLM long-context understanding
Evaluate models on retrieval, reasoning, and memory tasks
🔎 Similar Papers
No similar papers found.
H
Hongbo Zhao
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
M
Meng Wang
Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, CAS
F
Fei Zhu
Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, CAS
W
Wenzhuo Liu
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
B
Bolin Ni
Tencent Hunyuan Team
Fanhu Zeng
Fanhu Zeng
Institute of Automation, Chinese Academy of Sciences
Multimodal LLMTrustworthy AIEfficient Learning
G
Gaofeng Meng
Institute of Automation, Chinese Academy of Sciences; Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, CAS
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning