🤖 AI Summary
This work investigates whether visual encoding can effectively enhance the context compression capability of language models. To this end, we propose a unified evaluation framework specifically designed to assess visual compression methods—including DeepSeek-OCR—and systematically compare parameter-free mean pooling, learnable hierarchical encoders, and state-of-the-art visual encoders across text reconstruction and downstream language modeling tasks. Our results demonstrate that, under identical compression ratios, simple aggregation methods achieve comparable or superior reconstruction fidelity and significantly outperform visual compression approaches in language modeling—sometimes even surpassing truncated baselines. This study provides the first systematic empirical evidence revealing the limitations of current visual compression techniques for language modeling, challenging the implicit assumption that “optical compression benefits language understanding.” Furthermore, we introduce a lightweight, efficient, and interpretable alternative paradigm grounded in minimalistic, non-visual aggregation strategies.
📝 Abstract
DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives--parameter-free mean pooling and a learned hierarchical encoder--we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling--where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding