🤖 AI Summary
This study investigates the constrained generation capability of large language models (LLMs) in composing *Song Ci*—a classical Chinese poetic form governed by strict metrical, tonal (level/oblique tones), and rhyming constraints. To address this challenge, we propose a Generate-Critic framework: a differentiable, multi-dimensional automatic evaluator serves as a critic, providing fine-grained feedback signals to guide generation optimization. We systematically explore five prompting strategies—zero-shot, few-shot, completion, instruction tuning, and chain-of-thought—and apply supervised fine-tuning to lightweight open-source models. Comprehensive evaluation across 18 mainstream LLMs demonstrates that feedback-driven fine-tuning improves formal compliance by up to 5.88%. Furthermore, we introduce the first holistic evaluation benchmark for *Song Ci* generation, covering structural integrity, tonal patterns, and rhyme adherence.
📝 Abstract
This paper presents a systematic investigation into the constrained generation capabilities of large language models (LLMs) in producing Songci, a classical Chinese poetry form characterized by strict structural, tonal, and rhyme constraints defined by Cipai templates. We first develop a comprehensive, multi-faceted evaluation framework that includes: (i) a formal conformity score, (ii) automated quality assessment using LLMs, (iii) human evaluation, and (iv) classification-based probing tasks. Using this framework, we evaluate the generative performance of 18 LLMs, including 3 proprietary models and 15 open-source models across four families, under five prompting strategies: zero-shot, one-shot, completion-based, instruction-tuned, and chain-of-thought. Finally, we propose a Generate-Critic architecture in which the evaluation framework functions as an automated critic. Leveraging the critic's feedback as a reward signal, we fine-tune three lightweight open-source LLMs via supervised fine-tuning (SFT), resulting in improvements of up to 5.88% in formal conformity. Our findings offer new insights into the generative strengths and limitations of LLMs in producing culturally significant and formally constrained literary texts.