Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in multimodal chain-of-thought (CoT) reasoning: difficulty in modeling visual state transitions and lack of coherence in reasoning trajectories. To this end, we propose a unified vision-language CoT reasoning framework. Methodologically, we introduce a two-level reasoning paradigm: macro-level CoT for high-level task planning and micro-level CoT for fine-grained vision-language subtask execution. We further incorporate image-text interleaved supervision and multi-task joint training to enable coherent cross-modal state modeling. Built upon a unified vision-language understanding-generation architecture, our approach employs a two-stage structured training strategy. Evaluated on multiple benchmarks—including WISE, RISE, and KRIS—our method achieves state-of-the-art performance. All experiments are conducted efficiently using only eight A100 GPUs. The framework significantly improves reasoning coherence, generalization capability, and training efficiency.

Technology Category

Application Category

📝 Abstract
Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: https://sais-fuxi.github.io/projects/uni-cot/
Problem

Research questions and friction points this paper is trying to address.

Extending Chain-of-Thought reasoning to vision-language tasks
Modeling coherent visual state transitions for multimodal reasoning
Reducing computational cost in unified vision-language reasoning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Chain-of-Thought framework for multimodal reasoning
Two-level reasoning paradigm for task planning and execution
Structured training with interleaved image-text supervision
🔎 Similar Papers
No similar papers found.
Luozheng Qin
Luozheng Qin
Shanghai Academy of AI for Science
generative modeltext-to-image generationneck-choking technology
J
Jia Gong
Shanghai Academy of AI for Science
Y
Yuqing Sun
Shanghai Academy of AI for Science
T
Tianjiao Li
Nanyang Technological University
Mengping Yang
Mengping Yang
East China University of Science and Technology
Few-shot LearningGenerative Models
X
Xiaomeng Yang
Shanghai Academy of AI for Science
C
Chao Qu
INFTech
Z
Zhiyu Tan
Shanghai Academy of AI for Science, Fudan University
H
Hao Li
Shanghai Academy of AI for Science, Fudan University