Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of small language models (SLMs) in Chinese creative writing and the prohibitive deployment costs of large language models (LLMs), this paper proposes a principle-guided LLM-as-a-Judge framework for high-quality blessing generation with minimal annotation reliance. Methodologically: (1) a multi-agent rejection sampling mechanism generates preference data; (2) a reward model is trained, and a principle-aligned LLM serves as an interpretable judge to provide direct reinforcement learning feedback via Reinforcement Learning from AI Feedback (RLAIF); (3) adversarial training is integrated with a reflection mechanism to refine the policy. Experiments demonstrate substantial improvements over baselines across generation quality, training efficiency, and scalability. Automatic evaluation metrics correlate strongly with human judgments (Spearman’s ρ > 0.92), validating reliability. This work establishes a novel paradigm for resource-constrained creative text generation.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.
Problem

Research questions and friction points this paper is trying to address.

Enhancing creative writing in small language models efficiently
Reducing computational costs while maintaining creative output quality
Developing scalable AI feedback methods for creative tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent rejection sampling for reward modeling
Principle-guided LLM-as-a-Judge with adversarial training
Reinforcement Learning from AI Feedback framework
🔎 Similar Papers
No similar papers found.