DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Process reward models (PRMs) for multimodal large language models (MLLMs) suffer from limited generalization due to imbalanced quality across heterogeneous multimodal reasoning datasets and distributional shifts across tasks. Method: We propose a bi-level optimization domain reweighting framework that introduces a meta-learning-based dynamic data weighting mechanism: the outer loop estimates per-source quality bias, while the inner loop adaptively reweights samples to mitigate quality imbalance and enhance cross-task robustness. Contribution/Results: Evaluated on mathematical and general multimodal reasoning benchmarks, our method significantly improves test-time scaling performance of state-of-the-art MLLMs—achieving higher accuracy gains than existing data selection and inference scaling approaches. It establishes a novel paradigm for reliable PRM training in multimodal settings, advancing both data-efficient learning and robust reward modeling.

Technology Category

Application Category

📝 Abstract

Reasoning has substantially improved the performance of large language models (LLMs) on complicated tasks. Central to the current reasoning studies, Process Reward Models (PRMs) offer a fine-grained evaluation of intermediate reasoning steps and guide the reasoning process. However, extending PRMs to multimodal large language models (MLLMs) introduces challenges. Since multimodal reasoning covers a wider range of tasks compared to text-only scenarios, the resulting distribution shift from the training to testing sets is more severe, leading to greater generalization difficulty. Training a reliable multimodal PRM, therefore, demands large and diverse datasets to ensure sufficient coverage. However, current multimodal reasoning datasets suffer from a marked quality imbalance, which degrades PRM performance and highlights the need for an effective data selection strategy. To address the issues, we introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs which employs bi-level optimization. In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights, allowing the PRM to prioritize high-quality reasoning signals and alleviating the impact of dataset quality imbalance. In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset; this feedback updates the domain weights through an aggregation loss function, thereby improving the generalization capability of trained PRM. Extensive experiments on multiple multimodal reasoning benchmarks covering both mathematical and general reasoning show that test-time scaling with DreamPRM consistently improves the performance of state-of-the-art MLLMs. Further comparisons reveal that DreamPRM's domain-reweighting strategy surpasses other data selection methods and yields higher accuracy gains than existing test-time scaling approaches.

Problem

Research questions and friction points this paper is trying to address.

Addressing generalization challenges in multimodal Process Reward Models (PRMs)

Mitigating dataset quality imbalance in multimodal reasoning tasks

Improving PRM performance via domain-reweighted bi-level optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-reweighted training framework for multimodal PRMs

Bi-level optimization for improved generalization

Prioritizes high-quality reasoning signals in datasets

🔎 Similar Papers

No similar papers found.