MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling

📅 2024-10-14
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from inherent information loss in image understanding and generation, primarily due to distortions introduced by image discretization or diffusion-based denoising. To address this, we propose the first end-to-end multimodal autoregressive probabilistic model that operates directly on continuous-valued image tokens, enabling lossless image representation. We design a lightweight, decoupled diffusion head that separates understanding and generation pathways—bypassing error-prone denoising steps—and theoretically guarantee numerically stable training. Additionally, we introduce dual-task balancing and CLIP-aligned optimization to jointly optimize both modalities. Crucially, our model unifies understanding and generation under a single high-fidelity, continuous image representation—the first such approach. Experiments demonstrate state-of-the-art performance across 18 understanding benchmarks, matching CLIP’s encoding capability, while simultaneously enabling high-quality image generation and exhibiting strong scalability.

Technology Category

Application Category

📝 Abstract
Recent advancements in multi-modal large language models have propelled the development of joint probabilistic models capable of both image understanding and generation. However, we have identified that recent methods inevitably suffer from loss of image information during understanding task, due to either image discretization or diffusion denoising steps. To address this issue, we propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss. Differing from diffusion-based approaches, we disentangle the diffusion process from auto-regressive backbone model by employing a light-weight diffusion head on top each auto-regressed image patch embedding. In this way, when the model transits from image generation to understanding through text generation, the backbone model's hidden representation of the image is not limited to the last denoising step. To successfully train our method, we also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals. Through extensive evaluations on 18 image understanding benchmarks, MMAR demonstrates much more superior performance than other joint multi-modal models, matching the method that employs pretrained CLIP vision encoder, meanwhile being able to generate high quality images at the same time. We also showed that our method is scalable with larger data and model size.
Problem

Research questions and friction points this paper is trying to address.

Addresses image information loss in multi-modal understanding tasks
Proposes continuous-valued image tokens to prevent discretization loss
Disentangles diffusion process from auto-regressive backbone for better representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses continuous-valued image tokens
Employs light-weight diffusion head
Balances generation and understanding tasks
🔎 Similar Papers
No similar papers found.
J
Jian Yang
University of Science and Technology of China, WeChat, Tencent Inc.
Dacheng Yin
Dacheng Yin
University of Science and Technology of China
speech enhancementrepresentation learningspeech editing
Y
Yizhou Zhou
WeChat, Tencent Inc.
F
Fengyun Rao
WeChat, Tencent Inc.
W
Wei Zhai
University of Science and Technology of China
Y
Yang Cao
University of Science and Technology of China, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Z
Zhengjun Zha
University of Science and Technology of China, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center