PARM: Pipeline-Adapted Reward Model

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

220K/year
🤖 AI Summary
Existing reward models struggle to align with actual execution outcomes in multi-stage large language model (LLM) reasoning, thereby weakening their guidance for downstream tasks. This work proposes the first reward modeling approach tailored for multi-stage LLM pipelines in the context of code generation for combinatorial optimization problems. It introduces a two-stage architecture encompassing problem formulation and code generation, coupled with pipeline-adaptive data collection and Direct Preference Optimization (DPO) to ensure reward signals are consistent with final execution results. Evaluated on four optimization benchmarks, the method significantly improves both execution success rate and solution accuracy. Furthermore, cross-domain experiments on GSM8K demonstrate its strong generalization capability.

Technology Category

Application Category

📝 Abstract
Reward models (RMs) are central to aligning large language models (LLMs) with human preferences, powering RLHF and advanced decoding strategies. While most prior work focuses on single-step generation, real-world applications increasingly adopt multi-stage LLM pipelines, where effective reward guidance remains underexplored. We investigate this through code generation for combinatorial optimization, constructing a pipeline that integrates reward models into both formulation and solution stages. We identify a critical challenge: inconsistency between reward model predictions and actual pipeline execution outcomes. To address this, we propose the Pipeline-Adapted Reward Model (PARM), which leverages pipeline-specific data and direct preference optimization to align rewards with downstream feedback. We instantiate PARM as a two-stage pipeline (formulation -> code generation) and evaluate it on four public optimization benchmarks, measuring execution rate and solving accuracy against baselines and sampling methods. A supplementary cross-domain experiment on GSM8K assesses transferability. Results demonstrate that PARM consistently improves pipeline output quality and stability, providing new insights into reward modeling for multi-stage LLM reasoning.
Problem

Research questions and friction points this paper is trying to address.

reward model
multi-stage pipeline
LLM alignment
pipeline inconsistency
combinatorial optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipeline-Adapted Reward Model
multi-stage LLM pipeline
direct preference optimization
reward modeling
combinatorial optimization