🤖 AI Summary
Multimodal large language models (MLLMs) suffer from suboptimal token efficiency and robustness due to ad hoc tokenization strategies lacking principled foundations. Method: This work establishes, for the first time, a systematic mapping between MLLM tokenization, token compression, and token-level reasoning mechanisms and classical visual coding principles—including transform coding, rate-distortion optimization, and hierarchical representation—thereby constructing a unified cross-domain analytical framework. We propose a “bidirectional inspiration” mechanism: leveraging visual coding theory to guide MLLM token design while using MLLM semantic modeling to inform innovations in semantic visual encoders/decoders. Joint optimization of information fidelity and computational cost enables modular comparative analysis. Contribution/Results: The framework provides an interpretable theoretical foundation and a novel paradigm for efficient multimodal model compression, lightweight deployment, and the design of next-generation semantic visual codecs.
📝 Abstract
Classical visual coding and Multimodal Large Language Model (MLLM) token technology share the core objective - maximizing information fidelity while minimizing computational cost. Therefore, this paper reexamines MLLM token technology, including tokenization, token compression, and token reasoning, through the established principles of long-developed visual coding area. From this perspective, we (1) establish a unified formulation bridging token technology and visual coding, enabling a systematic, module-by-module comparative analysis; (2) synthesize bidirectional insights, exploring how visual coding principles can enhance MLLM token techniques' efficiency and robustness, and conversely, how token technology paradigms can inform the design of next-generation semantic visual codecs; (3) prospect for promising future research directions and critical unsolved challenges. In summary, this study presents the first comprehensive and structured technology comparison of MLLM token and visual coding, paving the way for more efficient multimodal models and more powerful visual codecs simultaneously.