🤖 AI Summary
To address the challenge of unifying understanding and generation capabilities in multimodal large language models—where joint optimization often induces conflicts and performance trade-offs—this paper proposes a staged (understanding → generation → editing) stacked autoregressive (AR) architecture. Building upon a frozen base AR model, we progressively stack isomorphic AR modules to decouple tasks and extend capabilities. We innovatively introduce high-capacity vector quantization (VQ) to enhance image representation granularity and design an implicit reasoning mechanism to improve generation robustness under complex conditions. A multi-stage frozen fine-tuning strategy ensures preservation of pre-existing understanding abilities while acquiring new generative competencies. Our method achieves state-of-the-art results on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), significantly advancing unified multimodal modeling.
📝 Abstract
Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.