When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study addresses the absence of large-scale automatic evaluation datasets and effective generation methods tailored for long-form literary creation within the TTCW (Task, Thought, Creativity, Writing) creativity framework. The authors present the first dataset comprising 263,911 long stories, each annotated with scores across 14 TTCW dimensions and accompanied by structured critiques. Leveraging Qwen3-4B and Qwen3-8B models, they compare fine-tuning strategies with and without chain-of-thought supervision. Experimental results demonstrate that non-reasoning fine-tuned models achieve superior and more stable performance in generating fixed-format critiques, attaining a peak score of 0.6820. In contrast, incorporating reasoning supervision leads to parsing failures and redundant outputs, thereby challenging the general applicability of reasoning-enhanced paradigms in this domain.

📝 Abstract

Automatic evaluation of long-form literary writing remains challenging, as generic LLM-as-Judge approaches may not fully capture creativity-related dimensions such as originality and flexibility. Although the Torrance Test of Creative Writing (TTCW) provides a structured creativity framework, and prior work has demonstrated reference-based TTCW evaluation at the pairwise level, no large-scale dataset exists for long-form TTCW-based literary review generation. We address this gap by constructing a dataset of 263,911 long-form stories, each annotated with scalar scores and meta-synthesised review comments across 14 TTCW-based dimensions. Using this dataset, we fine-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content. Results show that non-reasoning fine-tuning achieves stronger and more stable performance, with the best setting reaching an evaluation score of 0.6820. Further analysis shows that reasoning-supervised models are more prone to parse failures, often continuing with irrelevant or repetitive reasoning-style text rather than completing the required 14-metric review report. These results suggest that, for fixed-format rubric-based review generation, reasoning supervision is not straightforwardly beneficial, and precise metric-aligned scoring remains challenging even after task-specific fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

TTCW

long-form literary review

creativity evaluation

LLM-as-Judge

rubric-based generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

TTCW-based evaluation

long-form literary review generation

reasoning supervision