Prompt-Level Reward Specifications for Open-Ended Post-Training

๐Ÿ“… 2026-05-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations of existing open-domain post-training methods, which rely on scalar rewards assigned after generation and struggle to explicitly model prompt-specific local requirements, holistic preferences, and hard constraints. The authors propose a prompt-level reward specification framework that decouples reward definition from computation: task-adaptive scoring rubrics and executable constraint checkers are constructed offline based on prompts prior to training, and combined with global quality scores to produce a hybrid reward signal. This framework enables explicit reward specification without human preference labels, reference answers, or a separate reward modelโ€”supporting both offline and online reinforcement learning. Experiments demonstrate significant improvements in response ranking across multiple open-domain benchmarks and effective online policy optimization, while ablation studies confirm the complementary roles of individual components.
๐Ÿ“ Abstract
Open-ended post-training benefits from rewards that make prompt-specific success conditions explicit, rather than relying only on post-hoc scalar scores. In instruction following, writing, and decision-support tasks, response quality depends on local requirements, holistic preferences, and explicit constraints, but existing reward methods often leave these criteria implicit or cover only narrowly verifiable cases. We propose a prompt-level reward specification framework that separates reward specification from reward computation. Given only prompts, our framework constructs reusable task-adaptive rubrics and executable hard-constraint checkers offline, making reward criteria explicit before training and reusable across rollouts. At scoring time, artifact-anchored rubric and code scores are combined with an independent global score for residual holistic quality, yielding a normalized hybrid reward over requirement satisfaction, holistic quality, and deterministic constraints. The framework requires no human preference annotations, reference answers, or a separately trained reward model. Experiments show that the resulting reward improves offline RM-style response ranking and supports online reinforcement learning across multiple open-ended benchmarks. Ablations further show that rubrics, global scoring, and executable verification provide complementary supervision.
Problem

Research questions and friction points this paper is trying to address.

reward specification
open-ended post-training
prompt-level criteria
response quality
instruction following
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt-level reward specification
executable constraints
task-adaptive rubrics
hybrid reward
open-ended post-training
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zijun Weng
Fudan University; Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd.
X
Xiaohui Hu
Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd.
S
Shuangyong Song
Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd.
Yongxiang Li
Yongxiang Li
Professor, RMIT University
Electronic Materials and Devices
K
Kaidong Yu
Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd.
X
Xuanjing Huang
Fudan University