UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Although unified multimodal models excel at cross-modal understanding, they struggle to translate this capability into high-quality, controllable generation—a limitation akin to “conductive aphasia.” To address this, this work proposes UniCorn, a framework that, for the first time, enhances generative capacity within a unified model through fully self-supervised learning. UniCorn partitions the model into three roles—proposer, solver, and critic—to establish a self-play mechanism that generates high-quality interactive data. It further introduces cognitive mode reconstruction to explicitly convert implicit understanding into actionable generative signals. The proposed method achieves state-of-the-art performance across six image generation benchmarks, setting new records on TIIF, DPG, CompBench, and UniCycle, while improving scores by 5.0 and 6.5 points on WISE and OneIG, respectively.

Technology Category

Application Category

📝 Abstract

While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.

Problem

Research questions and friction points this paper is trying to address.

Unified Multimodal Models

Multimodal Generation

Conduction Aphasia

Cross-modal Comprehension

Generative Coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-improvement

unified multimodal models

self-generated supervision