MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification

📅 2025-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing PRM evaluation benchmarks focus narrowly on error detection, neglecting critical scenarios such as reasoning search and multimodal reasoning, resulting in incomplete assessment. To address this, we propose MPBench—the first multi-task, multimodal benchmark specifically designed for process-level reward models (PRMs). It covers three core paradigms: step correctness judgment, answer aggregation, and reasoning process search—and pioneers PRM evaluation in vision-language joint reasoning. We introduce a unified three-paradigm evaluation framework, incorporating process-level annotation, multimodal data construction, and role-driven evaluation protocols to model cross-modal reasoning trajectories. MPBench provides a standardized evaluation suite with baseline results, revealing systematic weaknesses of current PRMs in complex multimodal reasoning. It thus serves as a foundational resource for developing more robust and interpretable reasoning-augmented models.

Technology Category

Application Category

📝 Abstract
Reasoning is an essential capacity for large language models (LLMs) to address complex tasks, where the identification of process errors is vital for improving this ability. Recently, process-level reward models (PRMs) were proposed to provide step-wise rewards that facilitate reinforcement learning and data production during training and guide LLMs toward correct steps during inference, thereby improving reasoning accuracy. However, existing benchmarks of PRMs are text-based and focus on error detection, neglecting other scenarios like reasoning search. To address this gap, we introduce MPBench, a comprehensive, multi-task, multimodal benchmark designed to systematically assess the effectiveness of PRMs in diverse scenarios. MPBench employs three evaluation paradigms, each targeting a specific role of PRMs in the reasoning process: (1) Step Correctness, which assesses the correctness of each intermediate reasoning step; (2) Answer Aggregation, which aggregates multiple solutions and selects the best one; and (3) Reasoning Process Search, which guides the search for optimal reasoning steps during inference. Through these paradigms, MPBench makes comprehensive evaluations and provides insights into the development of multimodal PRMs.
Problem

Research questions and friction points this paper is trying to address.

Assessing PRMs in diverse reasoning scenarios
Evaluating correctness of intermediate reasoning steps
Guiding search for optimal reasoning processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark for process error identification
Step-wise rewards to enhance reasoning accuracy
Three evaluation paradigms for comprehensive assessment
🔎 Similar Papers
No similar papers found.