GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) underperform on vision-dominant multimodal reasoning tasks, primarily due to overreliance on logical and knowledge-driven “slow thinking” while neglecting dynamic integration and polysemous interpretation of visual cues. To address this, we propose Cue-Rethinking—a novel paradigm enabling iterative vision-language grounding via a two-stage training strategy: modality-guided cold-start initialization followed by reward-driven reinforcement learning. We introduce GThinker-11K, the first high-quality, human-annotated dataset tailored for general multimodal reasoning, constructed through synergistic human labeling and RL-generated sample curation. Evaluated on the M³CoT benchmark, our method achieves 81.5%, surpassing O4-mini; it yields an average 2.1% improvement across general multimodal reasoning tasks and matches state-of-the-art performance on mathematical reasoning.

Technology Category

Application Category

📝 Abstract
Despite notable advancements in multimodal reasoning, leading Multimodal Large Language Models (MLLMs) still underperform on vision-centric multimodal reasoning tasks in general scenarios. This shortfall stems from their predominant reliance on logic- and knowledge-based slow thinking strategies, while effective for domains like math and science, fail to integrate visual information effectively during reasoning. Consequently, these models often fail to adequately ground visual cues, resulting in suboptimal performance in tasks that require multiple plausible visual interpretations and inferences. To address this, we present GThinker (General Thinker), a novel reasoning MLLM excelling in multimodal reasoning across general scenarios, mathematics, and science. GThinker introduces Cue-Rethinking, a flexible reasoning pattern that grounds inferences in visual cues and iteratively reinterprets these cues to resolve inconsistencies. Building on this pattern, we further propose a two-stage training pipeline, including pattern-guided cold start and incentive reinforcement learning, designed to enable multimodal reasoning capabilities across domains. Furthermore, to support the training, we construct GThinker-11K, comprising 7K high-quality, iteratively-annotated reasoning paths and 4K curated reinforcement learning samples, filling the data gap toward general multimodal reasoning. Extensive experiments demonstrate that GThinker achieves 81.5% on the challenging comprehensive multimodal reasoning benchmark M$^3$CoT, surpassing the latest O4-mini model. It also shows an average improvement of 2.1% on general scenario multimodal reasoning benchmarks, while maintaining on-par performance in mathematical reasoning compared to counterpart advanced reasoning models. The code, model, and data will be released soon at https://github.com/jefferyZhan/GThinker.
Problem

Research questions and friction points this paper is trying to address.

Improving vision-centric multimodal reasoning in general scenarios
Addressing inadequate visual cue integration in MLLMs
Enhancing performance in tasks with multiple visual interpretations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cue-Rethinking for visual cue integration
Two-stage training pipeline for reasoning
GThinker-11K dataset for multimodal training
🔎 Similar Papers
No similar papers found.
Yufei Zhan
Yufei Zhan
Institute of Automation, Chinese Academy of Science
Computer VisionLarge Multimodal ModelsGrounding and Detection
Ziheng Wu
Ziheng Wu
ByteDance
ComputerVision
Yousong Zhu
Yousong Zhu
Associate Professor, Chinese Academy of Sciences, Institute of Automation
Multimodal Large Language ModelsSelf-supervised LearningObject Detection
R
Rongkun Xue
Xi’an Jiaotong University
Ruipu Luo
Ruipu Luo
Bytedance
Nature Language Processing
Z
Zhenghao Chen
ByteDance
C
Can Zhang
ByteDance
Y
Yifan Li
Renmin University of China
Z
Zhentao He
ByteDance
Z
Zheming Yang
ByteDance
M
Ming Tang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences
Minghui Qiu
Minghui Qiu
Alibaba Group
Deep LearningTransfer LearningChatbotsNLPArtificial Intelligence
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Peng Cheng Laboratory, Wuhan AI Research