Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of processing long code sequences with large language models, which are constrained by limited context windows and prone to semantic fragmentation when applying conventional text compression methods that disrupt code dependencies. To overcome this, the authors propose LongCodeOCR, a novel framework that renders code into sequences of two-dimensional images and leverages vision-language models (e.g., Glyph) to perform globally aware compression, effectively balancing broad contextual coverage with semantic fidelity. Experimental results demonstrate that, at comparable compression ratios, LongCodeOCR improves CompScore by 36.85 points on long-code summarization tasks. Furthermore, it reduces compression latency from 4.3 hours to just one minute at the 1M-token scale while achieving higher accuracy, highlighting a critical trade-off between contextual breadth and symbolic precision.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) struggle with long-context code due to window limitations. Existing textual code compression methods mitigate this via selective filtering but often disrupt dependency closure, causing semantic fragmentation. To address this, we introduce LongCodeOCR, a visual compression framework that renders code into compressed two-dimensional image sequences for Vision-Language Models (VLMs). By preserving a global view, this approach avoids the dependency breakage inherent in filtering. We systematically evaluate LongCodeOCR against the state-of-the-art LongCodeZip across four benchmarks spanning code summarization, code question answering, and code completion. Our results demonstrate that visual code compression serves as a viable alternative for tasks requiring global understanding. At comparable compression ratios ($\sim$1.7$\times$), LongCodeOCR improves CompScore on Long Module Summarization by 36.85 points over LongCodeZip. At a 1M-token context length with Glyph (a specialized 9B VLM), LongCodeOCR maintains higher accuracy than LongCodeZip while operating at about 4$\times$ higher compression. Moreover, compared with LongCodeZip, LongCodeOCR drastically reduces compression-stage overhead (reducing latency from $\sim$4.3 hours to $\sim$1 minute at 1M tokens). Finally, our results characterize a fundamental coverage--fidelity trade-off: visual code compression retains broader context coverage to support global dependencies, yet faces fidelity bottlenecks on exactness-critical tasks; by contrast, textual code compression preserves symbol-level precision while sacrificing structural coverage.
Problem

Research questions and friction points this paper is trying to address.

long-context code
code compression
dependency closure
semantic fragmentation
vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual code compression
vision-language models
long-context code
dependency preservation
global context understanding
🔎 Similar Papers
No similar papers found.
J
Jianping Zhong
Zhejiang University, Ningbo, China
G
Guochang Li
Zhejiang University, Hangzhou, China
C
Chen Zhi
Zhejiang University, Ningbo, China
Junxiao Han
Junxiao Han
Hangzhou City University
Zhen Qin
Zhen Qin
Zhejiang University
Service ComputingFederated LearningData MiningLarge Language Models
X
Xinkui Zhao
Zhejiang University, Ningbo, China
N
Nan Wang
Shenzhou Aerospace Software Technology Company Limited, Beijing, China
S
Shuiguang Deng
Zhejiang University, Hangzhou, China
Jianwei Yin
Jianwei Yin
Professor of Computer Science and Technology, Zhejiang University
Service ComputingComputer ArchitectureDistributed ComputingAI