LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the significant performance degradation of vision-language models on high-resolution text images under high compression ratios, where excessively small characters impede recognition. To mitigate this, the authors propose LensVLM, a novel framework that introduces, for the first time, a learnable selective decompression module coupled with a context-aware expansion mechanism. This enables the model to dynamically scan compressed images and selectively decompress only semantically critical regions, thereby preserving robustness without sacrificing compression efficiency. Built upon the Qwen-VL architecture (noting a likely typo in the original reference to “Qwen2-7B”), the system achieves near-lossless performance on multimodal document and code understanding tasks at an effective compression ratio of 4.3×, while supporting up to 10.1× compression—substantially outperforming existing baselines.

📝 Abstract

Vision Language Models (VLMs) offer the exciting possibility of processing text as rendered images, bypassing the need for tokenizing the text into long token sequences. Since VLM image encoders map fixed-size images to a fixed number of visual tokens, varying rendering resolution provides a fine-grained compression knob. However, accuracy deteriorates quickly as compression increases: characters shrink below the vision encoder's effective resolution, making them indistinguishable. To address this, we propose LensVLM, an inference framework and post-training recipe that enables VLMs to scan compressed images, then selectively expand only the relevant images to their uncompressed form via learned tools. Building on Qwen3.5-9B-Base, LensVLM maintains accuracy comparable to the full-text upper bound at 4.3x effective compression and outperforms retrieval-based, text- and visual-compression baselines up to 10.1x effective compression across seven text QA benchmarks. LensVLM also generalizes to multimodal document and code understanding tasks, with the accuracy gain over baselines growing as compression increases. Our analysis validates this approach: training makes visual compression robust to rendering choices, and as compression grows the model increasingly relies on expanded content rather than unreliable visual reading. The analysis also yields practical tool-choice guidance: text expansion is preferable for rendered text, while high-resolution image expansion suits native documents whose layout cues carry task-relevant information.

Problem

Research questions and friction points this paper is trying to address.

Vision Language Models

text compression

visual representation

rendered text

resolution degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

selective context expansion

compressed visual representation

vision language models