SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Efficient, high-quality automatic process annotation remains a critical bottleneck in enhancing large language models’ (LLMs) multi-step reasoning capabilities. To address this, we propose SPARE—a novel framework introducing reference-solution-guided, single-pass, step-level annotation. SPARE unifies process supervision and reward modeling via one-step alignment and explicit stepwise evaluation. It integrates structured step-level assessment, offline reinforcement learning fine-tuning, reward model training, and multi-LLM output ranking and aggregation—substantially reducing annotation overhead. Compared to tree-search methods, SPARE achieves a 2.6× speedup in inference and reduces latency to only 38%. It demonstrates significant performance gains across four challenging benchmarks: mathematical reasoning, multi-hop question answering, and spatial reasoning. We open-source the SPARE-PRM model to support reproducible research.

Technology Category

Application Category

📝 Abstract
Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.
Problem

Research questions and friction points this paper is trying to address.

Efficient high-quality automated process annotation for LLMs
Aligning solution steps to reference solutions for evaluation
Improving reasoning performance and efficiency in multi-step tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-pass annotation with reference-guided evaluation
Aligns solution steps to reference for evaluation
Improves reasoning performance and efficiency
🔎 Similar Papers
No similar papers found.