Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This paper addresses the long-standing disconnect between continuous and discrete visual tokenization in multimodal large language models (MLLMs), which hinders simultaneous achievement of high representational fidelity and computational efficiency. To bridge this gap, we propose the Continuous-Discrete Dual-nature Visual Tokenizer (CDD-VT). Inspired by wave-particle duality in physics, CDD-VT introduces a dynamic primitive allocation mechanism that adaptively adjusts encoding density based on input image complexity. It further integrates quantized codebooks to generate diverse, semantically rich visual primitives, enabling unified representation for both understanding and generation tasks. Extensive experiments demonstrate that CDD-VT consistently outperforms specialized continuous or discrete tokenizers across reconstruction, retrieval, and classification benchmarks. Notably, it achieves superior accuracy while significantly reducing engineering deployment overhead—e.g., memory footprint and inference latency—thereby establishing a novel paradigm for efficient and expressive multimodal representation learning.

Technology Category

Application Category

📝 Abstract

The unification of understanding and generation within a single multi-modal large model (MLLM) remains one significant challenge, largely due to the dichotomy between continuous and discrete visual tokenizations. Continuous tokenizer (CT) achieves strong performance by bridging multiple independently-trained understanding modules and generation modules, but suffers from complex multi-stage pipelines and substantial engineering overhead. Conversely, discrete tokenizers (DT) offer a conceptually elegant idea by quantizing each image into a primitive, but inevitably leading to information loss and performance degradation. To resolve this tension, we question the binary choice between CT and DT, inspired by the wave-particle duality of light, and propose the Continuous-Discrete Dualistic Visual Tokenizer (CDD-VT). We treat visual data as a flexible composition of image primitives derived from quantized codebooks, with the crucial insight that the primitive number assigned to each visual sample is adaptively determined according to its complexity: simple instances use a few primitives, emulating discrete tokenization, while complex instances use many, approximating continuous tokenization. Two core components are designed: Diverse Quantitative Primitives, which encourage primitives orthogonality to better populate information space, and Dynamic Primitive Allocator, which assesses sample complexity to determine the optimal set of primitives. Extensive experiments on reconstruction, retrieval and classification show that CDD-VT achieves superior performance over to specialized CT and DT, effectively getting strong result within a concise and scalable MLLM.

Problem

Research questions and friction points this paper is trying to address.

Unifying understanding and generation in multimodal models

Resolving dichotomy between continuous and discrete visual tokenizations

Adaptive token allocation based on visual complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive visual tokenization using continuous-discrete duality

Dynamic primitive allocation based on image complexity

Diverse orthogonal primitives for enhanced information representation

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM