STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of unifying understanding and generation capabilities in multimodal large language models—where joint optimization often induces conflicts and performance trade-offs—this paper proposes a staged (understanding → generation → editing) stacked autoregressive (AR) architecture. Building upon a frozen base AR model, we progressively stack isomorphic AR modules to decouple tasks and extend capabilities. We innovatively introduce high-capacity vector quantization (VQ) to enhance image representation granularity and design an implicit reasoning mechanism to improve generation robustness under complex conditions. A multi-stage frozen fine-tuning strategy ensures preservation of pre-existing understanding abilities while acquiring new generative competencies. Our method achieves state-of-the-art results on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), significantly advancing unified multimodal modeling.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.
Problem

Research questions and friction points this paper is trying to address.

Achieving unified multimodal understanding and generation
Avoiding cross-task interference in multimodal learning
Enhancing image representation granularity and generation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stacked autoregressive modules for progressive multimodal learning stages
High-capacity VQ for enhanced image representation granularity
Implicit reasoning mechanism for complex condition generation quality
🔎 Similar Papers
No similar papers found.