Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge of processing long code sequences with large language models, which are constrained by limited context windows and prone to semantic fragmentation when applying conventional text compression methods that disrupt code dependencies. To overcome this, the authors propose LongCodeOCR, a novel framework that renders code into sequences of two-dimensional images and leverages vision-language models (e.g., Glyph) to perform globally aware compression, effectively balancing broad contextual coverage with semantic fidelity. Experimental results demonstrate that, at comparable compression ratios, LongCodeOCR improves CompScore by 36.85 points on long-code summarization tasks. Furthermore, it reduces compression latency from 4.3 hours to just one minute at the 1M-token scale while achieving higher accuracy, highlighting a critical trade-off between contextual breadth and symbolic precision.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) struggle with long-context code due to window limitations. Existing textual code compression methods mitigate this via selective filtering but often disrupt dependency closure, causing semantic fragmentation. To address this, we introduce LongCodeOCR, a visual compression framework that renders code into compressed two-dimensional image sequences for Vision-Language Models (VLMs). By preserving a global view, this approach avoids the dependency breakage inherent in filtering. We systematically evaluate LongCodeOCR against the state-of-the-art LongCodeZip across four benchmarks spanning code summarization, code question answering, and code completion. Our results demonstrate that visual code compression serves as a viable alternative for tasks requiring global understanding. At comparable compression ratios ($\sim$1.7$\times$), LongCodeOCR improves CompScore on Long Module Summarization by 36.85 points over LongCodeZip. At a 1M-token context length with Glyph (a specialized 9B VLM), LongCodeOCR maintains higher accuracy than LongCodeZip while operating at about 4$\times$ higher compression. Moreover, compared with LongCodeZip, LongCodeOCR drastically reduces compression-stage overhead (reducing latency from $\sim$4.3 hours to $\sim$1 minute at 1M tokens). Finally, our results characterize a fundamental coverage--fidelity trade-off: visual code compression retains broader context coverage to support global dependencies, yet faces fidelity bottlenecks on exactness-critical tasks; by contrast, textual code compression preserves symbol-level precision while sacrificing structural coverage.

Problem

Research questions and friction points this paper is trying to address.

long-context code

code compression

dependency closure

semantic fragmentation

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual code compression

vision-language models

long-context code