🤖 AI Summary
Existing multimodal large language models (MLLMs) encode images as spatial visual tokens (e.g., raster-scanned patches), lacking the inherent recursive, hierarchical structure of language—hindering large language models’ ability to effectively model visual semantics.
Method: We propose a discrete recursive vision-language representation grounded in diffusion timesteps: treating each timestep as a semantic unit, we construct temporally ordered, hierarchically structured visual tokens naturally suited for autoregressive modeling. Our approach integrates discrete diffusion tokenization, timestep-aware visual encoding, multimodal joint pretraining, and co-optimization of autoregressive language modeling with progressive image reconstruction.
Contribution/Results: This work achieves the first native unification of LLM-style autoregressive inference and diffusion-based high-fidelity generation. It outperforms state-of-the-art MLLMs on both visual understanding and generation tasks, delivering a single-framework, zero-switch, high-performance bidirectional cross-modal capability.
📝 Abstract
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve superior performance for multimodal comprehension and generation simultaneously compared with other MLLMs. Project Page: https://DDT-LLaMA.github.io/.