AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of evaluating and improving the writing quality of AI-generated text. Methodologically, it (1) introduces WQ—the first unified benchmark for writing quality assessment, comprising 4,729 expert-annotated instances and formally defining the evaluation task; (2) proposes an edit-aware reward modeling paradigm to train a specialized Writing Quality Reward Model (WQRM); and (3) incorporates a test-time mechanism combining multi-candidate generation with re-ranking to enhance quality without fine-tuning. Empirically, WQRM achieves 74% accuracy on WQ and demonstrates strong cross-distribution generalization. Human evaluation shows that its top-ranked outputs receive a 66% overall preference from professional writers—rising to 72.2% in high-reward-difference scenarios—significantly outperforming baseline methods.

Technology Category

Application Category

📝 Abstract
AI-generated text is proliferating across domains, from creative writing and journalism to marketing content and scientific articles. Models can follow user-provided instructions to generate coherent and grammatically correct outputs but in this work, we study a more fundamental question: how do we evaluate and improve the writing quality of AI-generated text? Writing quality assessment has received less attention from the community, in part because it is fundamentally subjective and requires expertise. We first introduce the Writing Quality Benchmark (WQ) by consolidating five writing-preference datasets into 4,729 writing quality judgments. Our experiments show that competitive baselines, including state-of-the-art LLMs that excel at reasoning tasks, barely outperform random baselines on WQ. We then train specialized Writing Quality Reward Models (WQRM) of various sizes for writing quality assessment that demonstrate strong generalization on four out-of-distribution test sets and 74% accuracy on the WQ benchmark. To further show WQRM's practical benefits during inference, we leverage additional test-time compute to generate and rank multiple candidate revisions, allowing us to select higher-quality outputs from an initial draft. Human evaluation with 9 experienced writers confirm that WQRM-based selection produces writing samples preferred by experts 66% overall, and 72.2% when the reward gap is larger than 1 point. We release our datasets and models to encourage community engagement with writing quality assessment and development of AI writing systems better aligned with human preferences.
Problem

Research questions and friction points this paper is trying to address.

Evaluate and improve AI-generated text writing quality
Develop specialized models for writing quality assessment
Align AI writing outputs with human preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Consolidated Writing Quality Benchmark dataset
Specialized Writing Quality Reward Models
Test-time candidate revision ranking
🔎 Similar Papers
No similar papers found.