SAGE: Hierarchical LLM-Based Literary Evaluation through Ontology-Grounded Interpretive Dimensions

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study addresses the longstanding challenge of evaluating literary quality—a domain historically constrained by elusive, non-quantifiable dimensions such as cultural representation, emotional depth, and philosophical complexity. To overcome this, the authors propose SAGE, a novel hierarchical evaluation framework that uniquely integrates theory-driven ontological constructs with large language models (LLMs). Through iterative reflective reasoning, dual-modality analysis (content and metadata), and independent validation, SAGE enables interpretable and scalable assessment across three core dimensions: cultural, affective-psychological, and existential-philosophical. Empirical validation across 600 evaluations demonstrates 98.8% scoring convergence and inter-rater agreement exceeding 94% (p<0.001), effectively discriminating between canonical, popular, and LLM-generated texts. The framework further confirms high discriminant validity across its three layers (r=0.649–0.683), revealing systematic deficiencies in generated texts regarding critical stance and philosophical depth.

📝 Abstract

Evaluating literary quality requires assessing interpretive dimensions such as cultural representation, emotional depth, and philosophical sophistication that resist straightforward computational measurement. We introduce SAGE, a hierarchical evaluation framework that decomposes literary quality into ontology-grounded interpretive dimensions assessed through structured large language model evaluation with multi-round iterative reflection and independent validation. We validate the framework on 100 short stories (50 canonical works, 30 pulp fiction, 20 LLM-generated narratives) across three analytical layers (cultural, emotional-psychological, existential-philosophical) using dual-mode assessment. Across 600 evaluations, the framework achieves 98.8% score convergence and greater than 94% inter-rater agreement, with near-perfect mode invariance between content-based and metadata-based evaluation. Statistical analysis reveals a consistent genre hierarchy (Canonical > Pulp > LLM, all p<0.001) with layer-specific discrimination: cultural critique and philosophical depth exhibit very large effect sizes (Cohen's d>2.4), while emotional representation shows smaller gaps (d=1.68), suggesting that affective patterns are more learnable from training data than critical stance or philosophical depth. Cross-layer correlations (r=0.649-0.683) confirm the three dimensions capture empirically distinguishable quality facets. These findings demonstrate that theory-driven LLM evaluation can achieve measurement-grade reliability and support systematic identification of where current generative models fall short of human literary production, with direct implications for scalable automated evaluation of open-ended text generation.

Problem

Research questions and friction points this paper is trying to address.

literary evaluation

interpretive dimensions

computational measurement

large language models

ontology

Innovation

Methods, ideas, or system contributions that make the work stand out.

ontology-grounded evaluation

hierarchical LLM assessment

iterative reflection

literary quality dimensions

measurement-grade reliability

🔎 Similar Papers

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models

2024-06-13arXiv.orgCitations: 1