🤖 AI Summary
This work addresses the challenge of unifying visual understanding and generation within a shared representation space, which is hindered by conflicting requirements between decoding mechanisms and representational objectives. To resolve this, the authors propose a decoupled architecture that separates semantic and detail modeling. First, a unified vision tokenizer compresses tokens by 4× to enable efficient semantic modeling; then, a hybrid autoregressive/diffusion Transformer decoder—driven by a large language model—and a gated detail residual mechanism jointly restore high-frequency content. The approach substantially reduces training costs to only 20% of those required by the Tar-1.5B model while outperforming it on benchmarks such as GenEval and MMBench. Furthermore, it enables efficient high-resolution image processing, demonstrating both scalability and performance gains.
📝 Abstract
A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.