🤖 AI Summary
This study systematically evaluates the narrative understanding and analysis capabilities of GPT-3.5, PaLM2, and Llama2—representing leading closed- and open-weight LLMs—under controlled conditions. Method: We employ standardized prompt engineering to isolate model-specific behaviors, conduct cross-model response comparison, and introduce a novel four-dimensional human evaluation framework—assessing consistency, logical coherence, richness, and stance neutrality—with expert annotations serving as the gold standard for fair, reproducible, multi-dimensional quantification. Contribution/Results: Results reveal significant inter-model response disparities under identical prompts, reflecting fundamental differences in underlying reasoning mechanisms and semantic modeling capacities. The work not only exposes a pronounced performance gap between state-of-the-art open and closed LLMs in narrative intelligence but also establishes the first standardized, task-specific evaluation paradigm for narrative capability. This provides a transferable methodological benchmark for fine-grained, capability-oriented LLM assessment.
📝 Abstract
In this paper, we conducted a Multi-Perspective Comparative Narrative Analysis (CNA) on three prominent LLMs: GPT-3.5, PaLM2, and Llama2. We applied identical prompts and evaluated their outputs on specific tasks, ensuring an equitable and unbiased comparison between various LLMs. Our study revealed that the three LLMs generated divergent responses to the same prompt, indicating notable discrepancies in their ability to comprehend and analyze the given task. Human evaluation was used as the gold standard, evaluating four perspectives to analyze differences in LLM performance.