Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the notable gap between comprehension and generation capabilities in unified multimodal models, where generative performance often fails to leverage the model’s strong inherent understanding. To bridge this gap, we propose UniRect-CoT, a novel framework that, for the first time, harnesses the model’s own comprehension as a self-supervised signal to iteratively refine its outputs through a Chain-of-Thought (CoT) mechanism—without requiring additional training. By integrating intrinsic visual reasoning from the diffusion denoising process with intermediate corrections aligned to instructions, UniRect-CoT achieves “zero-cost” performance gains. The framework seamlessly integrates into existing unified multimodal architectures and consistently enhances output quality across diverse and complex generation tasks.

Technology Category

Application Category

📝 Abstract
Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model's rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing'' paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the ``free lunch'' hidden in the UMM's powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during generation.We regard the diffusion denoising process in UMMs as an intrinsic visual reasoning process and align the intermediate results with the target instruction understood by the model, serving as a self-supervisory signal to rectify UMM generation.Extensive experiments demonstrate that UniRect-CoT can be easily integrated into existing UMMs, significantly enhancing generation quality across diverse complex tasks.
Problem

Research questions and friction points this paper is trying to address.

Unified Multimodal Models
capability mismatch
generation quality
inherent understanding
knowledge activation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Multimodal Models
Reflective Rectification
Chain-of-Thought
Diffusion Denoising
Self-supervision