Hallucination or Creativity: How to Evaluate AI-Generated Scientific Stories?

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenge that existing evaluation methods struggle to distinguish between pedagogically justified creative rewriting and factual hallucinations in AI-generated scientific narratives. To this end, the authors propose StoryScore, a comprehensive evaluation framework that, for the first time, incorporates narrative control and pedagogical creativity as explicit assessment dimensions. StoryScore integrates multiple metrics—including semantic alignment, lexical anchoring, structural fidelity, redundancy avoidance, and entity-level hallucination detection—into a unified automatic scoring mechanism. This approach effectively differentiates legitimate creative adaptations from factual inaccuracies in scientific storytelling, significantly enhancing evaluation stability and applicability. Furthermore, the framework exposes critical limitations of current hallucination detection techniques when applied to contexts involving creative expression.

Technology Category

Application Category

📝 Abstract

Generative AI can turn scientific articles into narratives for diverse audiences, but evaluating these stories remains challenging. Storytelling demands abstraction, simplification, and pedagogical creativity-qualities that are not often well-captured by standard summarization metrics. Meanwhile, factual hallucinations are critical in scientific contexts, yet, detectors often misclassify legitimate narrative reformulations or prove unstable when creativity is involved. In this work, we propose StoryScore, a composite metric for evaluating AI-generated scientific stories. StoryScore integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection into a unified framework. Our analysis also reveals why many hallucination detection methods fail to distinguish pedagogical creativity from factual errors, highlighting a key limitation: while automatic metrics can effectively assess semantic similarity with original content, they struggle to evaluate how it is narrated and controlled.

Problem

Research questions and friction points this paper is trying to address.

hallucination

creativity

scientific storytelling

evaluation metric

AI-generated content

Innovation

Methods, ideas, or system contributions that make the work stand out.

StoryScore

scientific storytelling

hallucination detection