ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

In multimodal reasoning, existing approaches lack a well-defined notion of “meaningful interleaved reasoning chains” for iterative language-vision coordination. This work posits that textual and visual reasoning should be complementary—not isomorphic—modalities, and introduces a unified model enabling dynamic, progressive text-image alternation during inference. Methodologically, we fine-tune on 24K high-quality interleaved reasoning trajectories, integrating vision-language joint modeling to support diverse multimodal reasoning generation. Our core contributions are threefold: (1) establishing a novel paradigm of complementary interleaved reasoning chains; (2) achieving cross-task generalization, adaptive switching among reasoning patterns, and emergent capability in unseen visual manipulation skills; and (3) outperforming strong baselines by 34.7% on average across vision-dominant tasks—matching or exceeding larger proprietary models while demonstrating robust test-time scaling.

Technology Category

Application Category

📝 Abstract

Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.

Problem

Research questions and friction points this paper is trying to address.

Defining meaningful multimodal interleaved chain-of-thought reasoning

Developing complementary text-image modalities for reasoning advancement

Creating emergent multimodal intelligence through unified model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates progressive text-image reasoning steps

Fine-tuned on 24K interleaved reasoning traces

Exhibits emergent multimodal intelligence skills

🔎 Similar Papers

No similar papers found.