NARRA-Gym for Evaluating Interactive Narrative Agents

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

230K/year
πŸ€– AI Summary
Existing benchmarks lack the capacity to evaluate large language models’ ability to maintain coherent, dynamically evolving narratives across multi-turn interactions while adapting to user capabilities. This work proposes the first executable environment enabling end-to-end, fine-grained assessment by transforming sparse emotional seeds into complete interactive narratives. The framework integrates five core dimensions: story generation, long-context state management, character simulation, empathetic personalization, and narrative artifact synthesis. Built on an LLM-in-the-loop architecture, it combines controlled LLM-as-judge evaluations with real-user assessments across eight distinct character personas. Experiments on nine state-of-the-art large language models demonstrate that the benchmark effectively reveals significant differences in robustness, user experience, and sensitivity to personalization, thereby validating its efficacy as an evaluation tool for long-horizon, adaptive interactive storytelling.
πŸ“ Abstract
Interactive narrative tasks require LLMs to sustain a coherent, evolving story while adapting to a user over multiple turns. However, suitable benchmarks for this setting are limited: existing evaluations often focus on static prompts, isolated story generations, or post-hoc ratings, and therefore miss whether models can jointly manage story generation, long-context state and pacing, character simulation, empathic personalization, and story-grounded artifacts. We introduce NARRA-Gym, an executable evaluation environment that turns a sparse emotional seed into a complete interactive story episode and logs the full model-in-the-loop trajectory, including story construction, memory updates, planning, pacing interventions, and optional artifact synthesis. We evaluate nine frontier LLMs using a controlled LLM-as-judge sweep over eight benchmark personas and a human evaluation in which participants rate customized model outputs. Our results show substantial variation across models, personas, and evaluation dimensions: models that produce fluent stories can still fail on robustness, user experience, or resistance-sensitive personalization. These findings suggest that interactive narrative offers a useful benchmark for evaluating long-horizon, user-adaptive LLM behavior beyond isolated story quality.
Problem

Research questions and friction points this paper is trying to address.

interactive narrative
LLM evaluation
long-context reasoning
user-adaptive behavior
story coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive narrative
NARRA-Gym
model-in-the-loop evaluation
empathic personalization
long-context storytelling
Yue Huang
Yue Huang
PhD student, University of Notre Dame
trustworthy AIgenerative modelmachine learningAI for science
Yuchen Ma
Yuchen Ma
LMU Munich
Deep LearningCausal InferenceDiffusion ModelsFoundation Models
Jiayi Ye
Jiayi Ye
Master student in Shanghaitech University
Embodied AIComputer Vision
W
Wenjie Wang
University of Notre Dame
Z
Zipeng Ling
University of Pennsylvania
X
Xingjian Hu
Lehigh University
Yuexing Hao
Yuexing Hao
Research Fellow
Human Computer InteractionHealth Intelligence
Zichen Chen
Zichen Chen
UC Santa Barbara
Agentic LLMTrustworthy AIAI SafetySynthetic Data
Zhangchen Xu
Zhangchen Xu
University of Washington
(^._.^)οΎ‰Synthetic DataPost-TrainingSafetyFederated Learning
Y
Yunhong He
University of Notre Dame
Zhengqing Yuan
Zhengqing Yuan
PhD student, University of Notre Dame
NLPDeeplearningCV
Yujun Zhou
Yujun Zhou
University of Notre Dame
Trustworthy LLMLLM ReasoninngAdversarial Machine Learning
Kehan Guo
Kehan Guo
University of Notre Dame
LLMMachine ReasoningGenerative ModelsXAIAI for Science
Chaoran Chen
Chaoran Chen
University of Notre Dame
Human-Computer InteractionHuman-AI CollaborationUsable Privacy and SecurityLLM Agent
Toby Jia-Jun Li
Toby Jia-Jun Li
Assistant Professor, University of Notre Dame
Human-Computer InteractionHuman-AI CollaborationEnd User ProgrammingProgramming by DemonstrationFuture of Work
Stefan Feuerriegel
Stefan Feuerriegel
Professor, LMU Munich
AI in ManagementBusiness AnalyticsComputational Social ScienceAI for GoodCausal ML
Xiangliang Zhang
Xiangliang Zhang
Leonard C. Bettex Collegiate Professor, Computer Science and Engineering, University of Notre Dame
Machine LearningAI for Science