SAGE: Hierarchical LLM-Based Literary Evaluation through Ontology-Grounded Interpretive Dimensions

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

190K/year
πŸ€– AI Summary
This study addresses the longstanding challenge of evaluating literary qualityβ€”a domain historically constrained by elusive, non-quantifiable dimensions such as cultural representation, emotional depth, and philosophical complexity. To overcome this, the authors propose SAGE, a novel hierarchical evaluation framework that uniquely integrates theory-driven ontological constructs with large language models (LLMs). Through iterative reflective reasoning, dual-modality analysis (content and metadata), and independent validation, SAGE enables interpretable and scalable assessment across three core dimensions: cultural, affective-psychological, and existential-philosophical. Empirical validation across 600 evaluations demonstrates 98.8% scoring convergence and inter-rater agreement exceeding 94% (p<0.001), effectively discriminating between canonical, popular, and LLM-generated texts. The framework further confirms high discriminant validity across its three layers (r=0.649–0.683), revealing systematic deficiencies in generated texts regarding critical stance and philosophical depth.
πŸ“ Abstract
Evaluating literary quality requires assessing interpretive dimensions such as cultural representation, emotional depth, and philosophical sophistication that resist straightforward computational measurement. We introduce SAGE, a hierarchical evaluation framework that decomposes literary quality into ontology-grounded interpretive dimensions assessed through structured large language model evaluation with multi-round iterative reflection and independent validation. We validate the framework on 100 short stories (50 canonical works, 30 pulp fiction, 20 LLM-generated narratives) across three analytical layers (cultural, emotional-psychological, existential-philosophical) using dual-mode assessment. Across 600 evaluations, the framework achieves 98.8% score convergence and greater than 94% inter-rater agreement, with near-perfect mode invariance between content-based and metadata-based evaluation. Statistical analysis reveals a consistent genre hierarchy (Canonical > Pulp > LLM, all p<0.001) with layer-specific discrimination: cultural critique and philosophical depth exhibit very large effect sizes (Cohen's d>2.4), while emotional representation shows smaller gaps (d=1.68), suggesting that affective patterns are more learnable from training data than critical stance or philosophical depth. Cross-layer correlations (r=0.649-0.683) confirm the three dimensions capture empirically distinguishable quality facets. These findings demonstrate that theory-driven LLM evaluation can achieve measurement-grade reliability and support systematic identification of where current generative models fall short of human literary production, with direct implications for scalable automated evaluation of open-ended text generation.
Problem

Research questions and friction points this paper is trying to address.

literary evaluation
interpretive dimensions
computational measurement
large language models
ontology
Innovation

Methods, ideas, or system contributions that make the work stand out.

ontology-grounded evaluation
hierarchical LLM assessment
iterative reflection
literary quality dimensions
measurement-grade reliability