Training Data Efficiency in Multimodal Process Reward Models

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the high cost and redundancy of Monte Carlo–based annotations in training Multimodal Process Reward Models (MPRMs). It formally characterizes, for the first time, the information-gradient update mechanism underlying MPRM learning, revealing that effective training hinges on the mixture balance and reliability of positive and negative step labels. Building on this insight, the authors propose Balanced-Information Score (BIS), a data selection strategy that requires no additional annotations. BIS integrates retrospective step-level scoring, explicit modeling of label mixture balance and reliability, and efficient subset selection. Evaluated on VisualProcessBench, BIS achieves full-data performance using only 10% of the training samples and yields a 4.1% relative improvement over random sampling, substantially enhancing data efficiency.

Technology Category

Application Category

📝 Abstract

Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training. Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora. To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Process Reward Models

Training Data Efficiency

Monte Carlo Annotation

Data Redundancy

Visual Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Process Reward Models

Data Efficiency

Balanced-Information Score