Glyph: Scaling Context Windows via Visual-Text Compression

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitive computational and memory overhead incurred when scaling large language models (LLMs) to million-token contexts, this paper proposes Glyph: a framework that renders long texts as compact, semantically preserved images for processing by vision-language models (VLMs). Its core innovation lies in replacing conventional token-sequence expansion with visual-textual representation and introducing an LLM-guided genetic algorithm to automatically optimize rendering configurations—balancing compression ratio and semantic fidelity. Experiments demonstrate that Glyph achieves 3–4× token compression while matching the accuracy of Qwen3-8B. It accelerates prefill and decoding by approximately 4×, speeds up supervised fine-tuning (SFT) training by 2×, and enables 128K-input VLMs to handle million-token document understanding, code analysis, and multi-step reasoning tasks efficiently.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.
Problem

Research questions and friction points this paper is trying to address.

Scaling context windows to million-token level reduces computational costs
Compressing long texts into images while preserving semantic information
Achieving high token compression while maintaining accuracy in long-context tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts long text into compressed visual representations
Uses vision-language models for efficient semantic processing
Employs genetic search to optimize rendering configurations
🔎 Similar Papers
No similar papers found.
J
Jiale Cheng
The Conversational Artificial Intelligence (CoAI) Group, Tsinghua University
Y
Yusen Liu
Zhipu AI
X
Xinyu Zhang
Zhipu AI
Y
Yulin Fei
Zhipu AI
Wenyi Hong
Wenyi Hong
Tsinghua University
multimodal pretraining
R
Ruiliang Lyu
Zhipu AI
W
Weihan Wang
Zhipu AI
Z
Zhe Su
Zhipu AI
Xiaotao Gu
Xiaotao Gu
Zhipu AI
Language ModelingGenerative ModelsData Mining
X
Xiao Liu
Zhipu AI
Yushi Bai
Yushi Bai
Tsinghua University
Large Language ModelsMachine LearningKnowledge GraphAlgorithmic Game Theory
Jie Tang
Jie Tang
UW Madison
Computed Tomography
Hongning Wang
Hongning Wang
Associate Professor, Department of Computer Science and Technology, Tsinghua University
Machine LearningInformation RetrievalLarge Language Models
M
Minlie Huang
The Conversational Artificial Intelligence (CoAI) Group, Tsinghua University