🤖 AI Summary
To address the limitations of small language models (SLMs) in Chinese creative writing and the prohibitive deployment costs of large language models (LLMs), this paper proposes a principle-guided LLM-as-a-Judge framework for high-quality blessing generation with minimal annotation reliance. Methodologically: (1) a multi-agent rejection sampling mechanism generates preference data; (2) a reward model is trained, and a principle-aligned LLM serves as an interpretable judge to provide direct reinforcement learning feedback via Reinforcement Learning from AI Feedback (RLAIF); (3) adversarial training is integrated with a reflection mechanism to refine the policy. Experiments demonstrate substantial improvements over baselines across generation quality, training efficiency, and scalability. Automatic evaluation metrics correlate strongly with human judgments (Spearman’s ρ > 0.92), validating reliability. This work establishes a novel paradigm for resource-constrained creative text generation.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.