🤖 AI Summary
This study investigates the theoretical and practical limits of visual tokens in vision-language models (VLMs) for encoding textual information. Through a stress test that incrementally increases the number of characters embedded in images, the authors uncover a three-phase transition behavior—stable, transitional, and collapse—in token recognition performance as information density rises. They propose, for the first time, a universal probabilistic scaling law that integrates average token load with visual density, demonstrating its validity across multiple mainstream VLMs. This work provides both theoretical grounding and empirical guidance for compressing high-density visual contexts, clearly delineating the critical trade-off boundary between efficiency and accuracy in visual language understanding.
📝 Abstract
Recent vision-centric approaches have made significant strides in long-context modeling. Represented by DeepSeek-OCR, these models encode rendered text into continuous vision tokens, achieving high compression rates without sacrificing recognition precision. However, viewing the vision encoder as a lossy channel with finite representational capacity raises a fundamental question: what is the information upper bound of visual tokens? To investigate this limit, we conduct controlled stress tests by progressively increasing the information quantity (character count) within an image. We observe a distinct phase-transition phenomenon characterized by three regimes: a near-perfect Stable Phase, an Instability Phase marked by increased error variance, and a total Collapse Phase. We analyze the mechanical origins of these transitions and identify key factors. Furthermore, we formulate a probabilistic scaling law that unifies average vision token load and visual density into a latent difficulty metric. Extensive experiments across various Vision-Language Models demonstrate the universality of this scaling law, providing critical empirical guidance for optimizing the efficiency-accuracy trade-off in visual context compression.