I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image diffusion models lack multimodal contextual reasoning capabilities, constrained by pixel-level fine-tuning paradigms and the scarcity of high-quality reasoning-annotated data. To address this, we propose ThinkDiff, the first framework that employs a large language model (LLM) decoder as a proxy task to align the feature spaces of vision-language models (VLMs) and diffusion decoders—enabling logical reasoning and cross-modal compositional generation without requiring reasoning annotations. Our approach relies solely on lightweight cross-modal proxy training, avoiding complex fine-tuning or large-scale reasoning datasets. On the CoBSAT benchmark, ThinkDiff boosts reasoning accuracy from 19.2% to 46.3%, achieving this with only four A100 GPUs in five hours. Moreover, it significantly enhances the fidelity and logical consistency of multi-image–text joint compositional image generation.

Technology Category

Application Category

📝 Abstract
This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the $ extbf{LLM decoder}$ shares the same input feature space with $ extbf{diffusion decoders}$ that use the corresponding $ extbf{LLM encoder}$ for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.
Problem

Research questions and friction points this paper is trying to address.

Enhances multimodal reasoning in diffusion models
Simplifies alignment using vision-language training
Improves accuracy in complex multimodal tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns VLMs with LLM decoder
Simplifies training for diffusion models
Enhances multimodal reasoning capabilities
🔎 Similar Papers
Z
Zhenxing Mi
Department of Computer Science and Engineering (CSE), The Hong Kong University of Science and Technology (HKUST)
Kuan-Chieh Wang
Kuan-Chieh Wang
Snap Inc.
Machine LearningComputer Vision
G
Guocheng Qian
Snap Inc.
Hanrong Ye
Hanrong Ye
NVIDIA Research
multi-task multi-modal models
Runtao Liu
Runtao Liu
Hong Kong University of Science and Technology
computer visionai safetyRLHFreasoning
Sergey Tulyakov
Sergey Tulyakov
Director of Research, Snap Inc.
computer visionmachine learning
Kfir Aberman
Kfir Aberman
Research Scientist at Snap
Computer GraphicsGenerative AIPersonalization
D
Dan Xu
Department of Computer Science and Engineering (CSE), The Hong Kong University of Science and Technology (HKUST)