Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning (RL) methods for subjective language tasks—such as creative writing—rely on scalar human preference rewards, suffering from poor generalization and susceptibility to reward hacking (e.g., over-explanation, length bias). Method: We propose the first annotation-free, reference-free RL paradigm, comprising: (1) a generative pairwise reward model (GenRM) grounded in writing principles, enabling fine-grained, interpretable relative quality assessment; and (2) bootstrapped relative policy optimization (BRPO), integrating a transient reference mechanism and principle-driven self-critique to circumvent inherent limitations of scalar rewards. Contribution/Results: Integrated into the RLVR framework and applied to Writing-Zero, our approach achieves zero-shot supervised fine-tuning—outperforming scalar-reward baselines across diverse writing benchmarks. It effectively mitigates reward hacking and enables robust, reference-free, deception-resistant autonomous writing capability evolution.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has enabled large language models (LLMs) to achieve remarkable breakthroughs in reasoning tasks with objective ground-truth answers, such as mathematics and code generation. However, a significant gap remains for non-verifiable tasks, like creative writing and open-ended dialogue, where quality assessment is inherently subjective and lacks definitive references. Existing approaches for these domains often rely on scalar reward models trained with human preferences, which suffer from limited generalization and are prone to reward hacking, such as over-explanation and length bias. In this work, we propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards. We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm. The pairwise writing GenRM leverages self-principled critique to transform subjective assessments into reliable, verifiable rewards, while BRPO enables dynamic, reference-free pairwise comparison by leveraging a bootstrapped response as temporary reference from within group rollouts during RL training. Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning, as demonstrated by Writing-Zero, which shows consistent improvement and strong resistance to reward hacking compared to scalar reward baselines. Furthermore, our method achieves competitive results on both in-house and open-source writing benchmarks. Our findings suggest the potential to unify rule-based, reference-based, and reference-free reward modeling under the RLVR framework, thus paving the way for a comprehensive and scalable RL training paradigm applicable across all language tasks.
Problem

Research questions and friction points this paper is trying to address.

Bridges non-verifiable tasks like creative writing with verifiable rewards
Addresses limitations of scalar reward models prone to reward hacking
Transforms subjective assessments into reliable verifiable rewards for RL training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pairwise writing-principle-based Generative Reward Model
Bootstrapped Relative Policy Optimization algorithm
Unified RLVR framework for non-verifiable tasks