Optical Context Compression Is Just (Bad) Autoencoding

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates whether visual encoding can effectively enhance the context compression capability of language models. To this end, we propose a unified evaluation framework specifically designed to assess visual compression methods—including DeepSeek-OCR—and systematically compare parameter-free mean pooling, learnable hierarchical encoders, and state-of-the-art visual encoders across text reconstruction and downstream language modeling tasks. Our results demonstrate that, under identical compression ratios, simple aggregation methods achieve comparable or superior reconstruction fidelity and significantly outperform visual compression approaches in language modeling—sometimes even surpassing truncated baselines. This study provides the first systematic empirical evidence revealing the limitations of current visual compression techniques for language modeling, challenging the implicit assumption that “optical compression benefits language understanding.” Furthermore, we introduce a lightweight, efficient, and interpretable alternative paradigm grounded in minimalistic, non-visual aggregation strategies.

Technology Category

Application Category

📝 Abstract

DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives--parameter-free mean pooling and a learned hierarchical encoder--we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling--where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding

Problem

Research questions and friction points this paper is trying to address.

Tests if vision-based compression uniquely aids text reconstruction.

Evaluates whether optical compression improves language modeling performance.

Compares vision encoders with simpler alternatives for compression tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-based compression matches simple pooling alternatives

Optical context compression fails to improve language modeling

Reconstruction performance does not guarantee modeling utility

🔎 Similar Papers

No similar papers found.

Authors to Follow