Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unifying visual understanding and generation within a shared representation space, which is hindered by conflicting requirements between decoding mechanisms and representational objectives. To resolve this, the authors propose a decoupled architecture that separates semantic and detail modeling. First, a unified vision tokenizer compresses tokens by 4× to enable efficient semantic modeling; then, a hybrid autoregressive/diffusion Transformer decoder—driven by a large language model—and a gated detail residual mechanism jointly restore high-frequency content. The approach substantially reduces training costs to only 20% of those required by the Tar-1.5B model while outperforming it on benchmarks such as GenEval and MMBench. Furthermore, it enables efficient high-resolution image processing, demonstrating both scalability and performance gains.

Technology Category

Application Category

📝 Abstract
A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.
Problem

Research questions and friction points this paper is trying to address.

multimodal modeling
visual comprehension
image generation
unified model
representation mismatch
Innovation

Methods, ideas, or system contributions that make the work stand out.

decoupled representation
unified multimodal modeling
semantic-token compression
gated detail residuals
cascaded flow matching
Y
Yichen Zhang
Tsinghua University
D
Da Peng
Xi’an Jiaotong University
Zonghao Guo
Zonghao Guo
University of Chinese Academy of Sciences
Z
Zijian Zhang
University of Chinese Academy of Sciences
Xuesong Yang
Xuesong Yang
NVIDIA
Machine LearningDeep LearningNatural Language ProcessingSpeech Signal Processing
T
Tong Sun
University of Chinese Academy of Sciences
S
Shichu Sun
University of Chinese Academy of Sciences
Yidan Zhang
Yidan Zhang
PhD Student, the Chinese University of Hong Kong, Shenzhen
computer visiondeep learning
Yanghao Li
Yanghao Li
Apple
Computer Vision
Haiyan Zhao
Haiyan Zhao
Peking University
Wang Xu
Wang Xu
Harbin Institute of Technology
natural language processingartificial intelligence
Q
Qi Shi
Tsinghua University
Y
Yangang Sun
Tsinghua University
Chi Chen
Chi Chen
Tsinghua University
Multimodal LearningNatural Language ProcessingMachine Learning
Shuo Wang
Shuo Wang
Department of Engineering Physics,Tsinghua University
Computer GraphicsMedical Image Analysis
Yukun Yan
Yukun Yan
Tsinghua University
Large Language Model
Xu Han
Xu Han
Research Assistant Professor, Tsinghua University
Natural Language ProcessingLarge Language ModelKnowledge GraphInformation Extraction
Qiang Ma
Qiang Ma
Assistant Researcher of Tsinghua University
wireless sensor networksnetwork diagnosis
Wei Ke
Wei Ke
Xi'an Jiaotong University
Computer Vision and Deep Learning
Liang Wang
Liang Wang
Institute of Psychology, Chinese Academy of Sciences
ECoGfMRINeuronal oscillationsBrain networksSpatial attention
Zhiyuan Liu
Zhiyuan Liu
Tsinghua University
autonomous drivingtraffic simulation
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing