Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing vision-language models (VLMs) lack fine-grained structural modeling and quality assessment of intermediate reasoning steps in chain-of-thought (CoT) inference. Method: We propose the first step-level reasoning framework, comprising (i) construction of fine-grained step-level reasoning data, (ii) a process reward model (PRM) that scores the quality of each reasoning step, and (iii) reinforcement learning to optimize the entire reasoning chain. Contribution/Results: Our approach enables both evaluability and optimizability of intermediate multimodal reasoning steps—previously unattainable in VLMs. It achieves significant and consistent performance gains across challenging vision-language understanding benchmarks, including ScienceQA, MMMU, and MME. Furthermore, we empirically validate the effectiveness of reasoning-time scaling, demonstrating improved accuracy with increased inference-time computation. This work establishes a novel paradigm for trustworthy multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Chain of thought reasoning has demonstrated remarkable success in large language models, yet its adaptation to vision-language reasoning remains an open challenge with unclear best practices. Existing attempts typically employ reasoning chains at a coarse-grained level, which struggles to perform fine-grained structured reasoning and, more importantly, are difficult to evaluate the reward and quality of intermediate reasoning. In this work, we delve into chain of step reasoning for vision-language models, enabling assessing reasoning step quality accurately and leading to effective reinforcement learning and inference-time scaling with fine-grained rewards. We present a simple, effective, and fully transparent framework, including the step-level reasoning data, process reward model (PRM), and reinforcement learning training. With the proposed approaches, our models set strong baselines with consistent improvements on challenging vision-language benchmarks. More importantly, we conduct a thorough empirical analysis and ablation study, unveiling the impact of each component and several intriguing properties of inference-time scaling. We believe this paper serves as a baseline for vision-language models and offers insights into more complex multimodal reasoning. Our dataset, PRM, and code will be available at https://github.com/baaivision/CoS.

Problem

Research questions and friction points this paper is trying to address.

Adapting chain of thought reasoning to vision-language models effectively

Evaluating fine-grained reasoning step quality for vision-language tasks

Developing transparent framework with reinforcement learning for multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained step-level reasoning data

Process reward model for step evaluation

Reinforcement learning with transparent framework

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling