Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) lack fine-grained structural modeling and quality assessment of intermediate reasoning steps in chain-of-thought (CoT) inference. Method: We propose the first step-level reasoning framework, comprising (i) construction of fine-grained step-level reasoning data, (ii) a process reward model (PRM) that scores the quality of each reasoning step, and (iii) reinforcement learning to optimize the entire reasoning chain. Contribution/Results: Our approach enables both evaluability and optimizability of intermediate multimodal reasoning steps—previously unattainable in VLMs. It achieves significant and consistent performance gains across challenging vision-language understanding benchmarks, including ScienceQA, MMMU, and MME. Furthermore, we empirically validate the effectiveness of reasoning-time scaling, demonstrating improved accuracy with increased inference-time computation. This work establishes a novel paradigm for trustworthy multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Chain of thought reasoning has demonstrated remarkable success in large language models, yet its adaptation to vision-language reasoning remains an open challenge with unclear best practices. Existing attempts typically employ reasoning chains at a coarse-grained level, which struggles to perform fine-grained structured reasoning and, more importantly, are difficult to evaluate the reward and quality of intermediate reasoning. In this work, we delve into chain of step reasoning for vision-language models, enabling assessing reasoning step quality accurately and leading to effective reinforcement learning and inference-time scaling with fine-grained rewards. We present a simple, effective, and fully transparent framework, including the step-level reasoning data, process reward model (PRM), and reinforcement learning training. With the proposed approaches, our models set strong baselines with consistent improvements on challenging vision-language benchmarks. More importantly, we conduct a thorough empirical analysis and ablation study, unveiling the impact of each component and several intriguing properties of inference-time scaling. We believe this paper serves as a baseline for vision-language models and offers insights into more complex multimodal reasoning. Our dataset, PRM, and code will be available at https://github.com/baaivision/CoS.
Problem

Research questions and friction points this paper is trying to address.

Adapting chain of thought reasoning to vision-language models effectively
Evaluating fine-grained reasoning step quality for vision-language tasks
Developing transparent framework with reinforcement learning for multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained step-level reasoning data
Process reward model for step evaluation
Reinforcement learning with transparent framework
🔎 Similar Papers
No similar papers found.
Honghao Chen
Honghao Chen
PhD student in Department of Chemical Engineering, Tsinghua University
Chemical EngineeringArtificial IntelligenceCatalysis
Xingzhou Lou
Xingzhou Lou
Bytedance
Deep LearningReinforcement Learning
Xiaokun Feng
Xiaokun Feng
Institute of Automation,Chinese Academy of Sciences
computer versiondeep learning
K
Kaiqi Huang
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
X
Xinlong Wang
Beijing Academy of Artificial Intelligence