Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Open-domain long-text generation evaluation suffers from the difficulty of quantifying semantic quality, while existing metrics (e.g., ROUGE, BERTScore) neglect coherence, stylistic consistency, and fine-grained relevance. To address this, we propose PrefBERT—the first semantic reward model trained on multi-sentence/paragraph-level Likert-scale human preference annotations. PrefBERT is the first to explicitly integrate human preference signals into the GRPO reinforcement learning framework, delivering fine-grained, semantics-aware positive and negative rewards. Our approach combines BERT-based preference modeling with multidimensional human evaluation and LLM-as-a-judge validation. Experiments demonstrate that PrefBERT significantly improves alignment between generated text and human preferences, consistently outperforming baselines in both automated and human evaluations. Moreover, it exhibits robust generalization across diverse text lengths and writing styles.

Technology Category

Application Category

📝 Abstract

Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at https://github.com/zli12321/long_form_rl.

Problem

Research questions and friction points this paper is trying to address.

Evaluating open-ended long-form generation quality is challenging

Existing methods lack coherence, style, and relevance metrics

PrefBERT improves semantic reward feedback for GRPO training

Innovation

Methods, ideas, or system contributions that make the work stand out.

PrefBERT scores open-ended long-form generation effectively

Semantic rewards replace traditional metrics like ROUGE-L

Human-aligned feedback improves policy model training

🔎 Similar Papers

A Critical Look At Tokenwise Reward-Guided Text Generation