🤖 AI Summary
Multimodal large language models (MLLMs) suffer from inherent information loss in image understanding and generation, primarily due to distortions introduced by image discretization or diffusion-based denoising. To address this, we propose the first end-to-end multimodal autoregressive probabilistic model that operates directly on continuous-valued image tokens, enabling lossless image representation. We design a lightweight, decoupled diffusion head that separates understanding and generation pathways—bypassing error-prone denoising steps—and theoretically guarantee numerically stable training. Additionally, we introduce dual-task balancing and CLIP-aligned optimization to jointly optimize both modalities. Crucially, our model unifies understanding and generation under a single high-fidelity, continuous image representation—the first such approach. Experiments demonstrate state-of-the-art performance across 18 understanding benchmarks, matching CLIP’s encoding capability, while simultaneously enabling high-quality image generation and exhibiting strong scalability.
📝 Abstract
Recent advancements in multi-modal large language models have propelled the development of joint probabilistic models capable of both image understanding and generation. However, we have identified that recent methods inevitably suffer from loss of image information during understanding task, due to either image discretization or diffusion denoising steps. To address this issue, we propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss. Differing from diffusion-based approaches, we disentangle the diffusion process from auto-regressive backbone model by employing a light-weight diffusion head on top each auto-regressed image patch embedding. In this way, when the model transits from image generation to understanding through text generation, the backbone model's hidden representation of the image is not limited to the last denoising step. To successfully train our method, we also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals. Through extensive evaluations on 18 image understanding benchmarks, MMAR demonstrates much more superior performance than other joint multi-modal models, matching the method that employs pretrained CLIP vision encoder, meanwhile being able to generate high quality images at the same time. We also showed that our method is scalable with larger data and model size.