Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

📅 2026-01-29

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

This work proposes UniMRG, a unified multimodal framework that addresses the limited synergy between visual understanding and generation in existing models. By integrating auxiliary generative tasks—such as pixel reconstruction, depth estimation, and semantic segmentation—within a single architecture, UniMRG enables bidirectional enhancement between comprehension and synthesis. The method employs an architecture-agnostic post-training strategy, uniquely leveraging multitask generation to retroactively improve visual understanding capabilities. Experimental results demonstrate that UniMRG significantly advances performance in fine-grained perception, spatial relationship modeling, and hallucination suppression, while simultaneously enhancing generation quality. These findings validate the efficacy of the proposed understanding-generation co-evolution mechanism within a unified model.

Technology Category

Application Category

📝 Abstract

Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.

Problem

Research questions and friction points this paper is trying to address.

Unified Multimodal Models

visual understanding

generation

multi-representation

post-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Multimodal Models

Multi-Representation Generation

Visual Understanding

Post-Training

Auxiliary Generation Tasks

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models

2024-05-16Citations: 2

Authors to Follow