Help Me Write a Story: Evaluating LLMs' Ability to Generate Writing Feedback

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study systematically evaluates large language models’ (LLMs) capabilities and limitations in providing feedback for creative writing, specifically their ability to identify core writing issues and balance critical with encouraging commentary. To this end, we construct the first controlled, fully annotated test set of 1,300 problematic short stories and introduce the novel task of “writing problem prioritization.” We propose a structured error-injection methodology to generate diverse, precisely localized writing flaws. Our evaluation framework integrates automated metrics and human assessment across four dimensions: accuracy, specificity, affective balance, and actionability. Results reveal that while current LLMs generate relatively concrete and accurate feedback, they consistently struggle with distinguishing superficial stylistic flaws from deep-seated narrative deficiencies and fail to modulate affective tone—often over-criticizing or offering vague encouragement. This work establishes a benchmark dataset, a rigorous evaluation paradigm, and concrete directions for advancing LLM-based writing assistance systems.

Technology Category

Application Category

📝 Abstract

Can LLMs provide support to creative writers by giving meaningful writing feedback? In this paper, we explore the challenges and limitations of model-generated writing feedback by defining a new task, dataset, and evaluation frameworks. To study model performance in a controlled manner, we present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues. We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics. Our analysis shows that current models have strong out-of-the-box behavior in many respects -- providing specific and mostly accurate writing feedback. However, models often fail to identify the biggest writing issue in the story and to correctly decide when to offer critical vs. positive feedback.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate meaningful writing feedback

Exploring challenges of model-generated feedback via new task and dataset

Assessing model performance in identifying critical writing issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Defined new task and evaluation frameworks

Created corrupted test set for controlled study

Used automatic and human evaluation metrics

🔎 Similar Papers

No similar papers found.