Global Context Compression with Interleaved Vision-Text Transformation

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing vision-language models, which compress only textual tokens during the prefill phase and fail to reduce computational and memory overhead during autoregressive token-by-token generation. To overcome this, the authors propose VIST2, a Transformer-based model that interleaves blocks of text with their corresponding visual encodings, enabling joint token compression across both prefill and decoding stages. Notably, VIST2 predicts the distribution of the next textual token using only visual tokens within the pre-context. The model is trained via a combination of text-to-sketch rendering, curriculum pretraining, and modality-interleaved instruction fine-tuning. Under a 4× compression ratio, VIST2 achieves an average 3× speedup in long-text generation tasks while reducing memory usage by 77% and FLOPs by 74%.

Technology Category

Application Category

📝 Abstract
Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.
Problem

Research questions and friction points this paper is trying to address.

global context compression
vision-language models
token efficiency
inference optimization
Transformer compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

global context compression
interleaved vision-text transformation
visual token encoding
efficient inference
optical language modeling
🔎 Similar Papers
No similar papers found.
D
Dian Jiao
China Electronics Cloud Technology Co., Ltd.
Jiaxin Duan
Jiaxin Duan
北京大学
机器学习,深度学习,自然语言处理
S
Shuai Zhao
China Electronics Cloud Technology Co., Ltd.
J
Jiabing Leng
China Electronics Cloud Technology Co., Ltd.
Y
Yiran Zhang
China Electronics Cloud Technology Co., Ltd.
Feng Huang
Feng Huang
Neusoft Medical System
MRIreconstructionsegmentationregistration