π€ AI Summary
Open-domain long-text generation evaluation suffers from the difficulty of quantifying semantic quality, while existing metrics (e.g., ROUGE, BERTScore) neglect coherence, stylistic consistency, and fine-grained relevance. To address this, we propose PrefBERTβthe first semantic reward model trained on multi-sentence/paragraph-level Likert-scale human preference annotations. PrefBERT is the first to explicitly integrate human preference signals into the GRPO reinforcement learning framework, delivering fine-grained, semantics-aware positive and negative rewards. Our approach combines BERT-based preference modeling with multidimensional human evaluation and LLM-as-a-judge validation. Experiments demonstrate that PrefBERT significantly improves alignment between generated text and human preferences, consistently outperforming baselines in both automated and human evaluations. Moreover, it exhibits robust generalization across diverse text lengths and writing styles.
π Abstract
Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at https://github.com/zli12321/long_form_rl.