UniGen: Enhanced Training&Test-Time Strategies for Unified Multimodal Understanding and Generation

๐Ÿ“… 2025-05-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of unifying image understanding and generation in multimodal large language models (MLLMs), where existing approaches suffer from suboptimal generation quality and poor semantic alignment between text and images. We propose UniGen, a unified architecture, and an end-to-end optimization framework. Methodologically, we employ a full pipeline: multi-stage pretraining, supervised fine-tuning, and direct preference optimization (DPO); introduce Chain-of-Thought Verification (CoT-V), a novel test-time mechanism for autonomous, stepwise post-hoc evaluation of cross-modal alignment; and integrate Best-of-N sampling to enhance inference quality. Our contributions include: (1) the first unified modeling and end-to-end optimization of understanding and generation using entirely open-source data; (2) new state-of-the-art resultsโ€”GenEval score of 0.78 and DPG-Bench score of 85.19; and (3) ablation studies that systematically identify key bottlenecks and optimization pathways in unified multimodal modeling.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization. More importantly, we propose a new Chain-of-Thought Verification (CoT-V) strategy for test-time scaling, which significantly boosts UniGen's image generation quality using a simple Best-of-N test-time strategy. Specifically, CoT-V enables UniGen to act as both image generator and verifier at test time, assessing the semantic alignment between a text prompt and its generated image in a step-by-step CoT manner. Trained entirely on open-source datasets across all stages, UniGen achieves state-of-the-art performance on a range of image understanding and generation benchmarks, with a final score of 0.78 on GenEval and 85.19 on DPG-Bench. Through extensive ablation studies, our work provides actionable insights and addresses key challenges in the full life cycle of building unified MLLMs, contributing meaningful directions to the future research.
Problem

Research questions and friction points this paper is trying to address.

Unified multimodal model for image understanding and generation
Enhancing image generation via Chain-of-Thought Verification
Optimizing training pipeline for multimodal large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal training pipeline
Chain-of-Thought Verification strategy
Open-source dataset training
๐Ÿ”Ž Similar Papers
No similar papers found.