Glyph: Scaling Context Windows via Visual-Text Compression

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the prohibitive computational and memory overhead incurred when scaling large language models (LLMs) to million-token contexts, this paper proposes Glyph: a framework that renders long texts as compact, semantically preserved images for processing by vision-language models (VLMs). Its core innovation lies in replacing conventional token-sequence expansion with visual-textual representation and introducing an LLM-guided genetic algorithm to automatically optimize rendering configurations—balancing compression ratio and semantic fidelity. Experiments demonstrate that Glyph achieves 3–4× token compression while matching the accuracy of Qwen3-8B. It accelerates prefill and decoding by approximately 4×, speeds up supervised fine-tuning (SFT) training by 2×, and enables 128K-input VLMs to handle million-token document understanding, code analysis, and multi-step reasoning tasks efficiently.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.

Problem

Research questions and friction points this paper is trying to address.

Scaling context windows to million-token level reduces computational costs

Compressing long texts into images while preserving semantic information

Achieving high token compression while maintaining accuracy in long-context tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts long text into compressed visual representations

Uses vision-language models for efficient semantic processing

Employs genetic search to optimize rendering configurations

🔎 Similar Papers

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring