Large Language Model as Token Compressor and Decompressor

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an efficient compression method based on self-expressive autoencoding to address the high computational and memory costs incurred by large language models (LLMs) when processing long texts. The approach leverages off-the-shelf pretrained LLMs to compress long documents into variable-length discrete latent tokens—termed Z-tokens—that enable lossless reconstruction, thereby revealing for the first time that existing LLMs can function as content-adaptive compressors and decompressors. By incorporating lightweight LoRA adapters, discrete latent variable modeling, and an autoregressive generation mechanism over the Z-token space, the method achieves up to 18× token compression on benchmarks such as Wikipedia while preserving both faithful text reconstruction and downstream task performance.

Technology Category

Application Category

📝 Abstract
In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.
Problem

Research questions and friction points this paper is trying to address.

token compression
long-context reasoning
large language models
text representation
efficient inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

token compression
large language model
Z-tokens
LoRA
autoencoding
🔎 Similar Papers
No similar papers found.