MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unified generative models struggle with complex multimodal generation tasks involving intertwined multi-condition constraints. To address this, we propose the first unified generation framework grounded in Multimodal Chain-of-Thought (MCoT), which enables precise cross-modal semantic coordination via stepwise reasoning and element-level text-image disentangled alignment. Methodologically, we introduce a native MCoT training paradigm and a Mixture of Transformer Experts (MTXpert) architecture featuring expert parallelism, seamlessly integrating natural language generation (NLG) and visual synthesis capabilities without inducing modality conflicts. Additionally, we incorporate a self-reflective multimodal reasoning mechanism to enhance logical consistency. Our approach achieves significant improvements over state-of-the-art methods across multiple text-to-image and image-to-text benchmarks. Generated images exhibit marked gains in detail fidelity, structural coherence, and condition adherence.

Technology Category

Application Category

📝 Abstract
Unified generative models have demonstrated extraordinary performance in both text and image generation. However, they tend to underperform when generating intricate images with various interwoven conditions, which is hard to solely rely on straightforward text-to-image generation. In response to this challenge, we introduce MINT, an innovative unified generative model, empowered with native multimodal chain of thought (MCoT) for enhanced image generation for the first time. Firstly, we design Mixture of Transformer Experts (MTXpert), an expert-parallel structure that effectively supports both natural language generation (NLG) and visual capabilities, while avoiding potential modality conflicts that could hinder the full potential of each modality. Building on this, we propose an innovative MCoT training paradigm, a step-by-step approach to multimodal thinking, reasoning, and reflection specifically designed to enhance image generation. This paradigm equips MINT with nuanced, element-wise decoupled alignment and a comprehensive understanding of textual and visual components. Furthermore, it fosters advanced multimodal reasoning and self-reflection, enabling the construction of images that are firmly grounded in the logical relationships between these elements. Notably, MINT has been validated to exhibit superior performance across multiple benchmarks for text-to-image (T2I) and image-to-text (I2T) tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhances image generation with multimodal chain of thought.
Addresses underperformance in intricate image generation tasks.
Improves alignment and understanding of text and visual elements.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Transformer Experts (MTXpert) for multimodal support
Multimodal Chain of Thought (MCoT) training paradigm
Enhanced image generation with element-wise alignment
🔎 Similar Papers
No similar papers found.
Y
Yi Wang
Zhejiang University
Mushui Liu
Mushui Liu
Zhejiang University
Generative ModelsMulti-modal LearningFew-shot Learning
Wanggui He
Wanggui He
Researcher, Alibaba Group
ai
L
Longxiang Zhang
Alibaba Group
Ziwei Huang
Ziwei Huang
Zhejiang University
Multimodal LLMsAIGC
G
Guanghao Zhang
Alibaba Group
Fangxun Shu
Fangxun Shu
Bytedance
Multimodal
Z
Zhong Tao
Alibaba Group
Dong She
Dong She
University of Science and Technology of China
Computer vison
Z
Zhelun Yu
Alibaba Group
H
Haoyuan Li
Alibaba Group
W
Weilong Dai
Alibaba Group
M
Mingli Song
Zhejiang University
J
Jie Song
Zhejiang University
H
Hao Jiang
Alibaba Group