NARRA-Gym for Evaluating Interactive Narrative Agents

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing benchmarks lack the capacity to evaluate large language models’ ability to maintain coherent, dynamically evolving narratives across multi-turn interactions while adapting to user capabilities. This work proposes the first executable environment enabling end-to-end, fine-grained assessment by transforming sparse emotional seeds into complete interactive narratives. The framework integrates five core dimensions: story generation, long-context state management, character simulation, empathetic personalization, and narrative artifact synthesis. Built on an LLM-in-the-loop architecture, it combines controlled LLM-as-judge evaluations with real-user assessments across eight distinct character personas. Experiments on nine state-of-the-art large language models demonstrate that the benchmark effectively reveals significant differences in robustness, user experience, and sensitivity to personalization, thereby validating its efficacy as an evaluation tool for long-horizon, adaptive interactive storytelling.

📝 Abstract

Interactive narrative tasks require LLMs to sustain a coherent, evolving story while adapting to a user over multiple turns. However, suitable benchmarks for this setting are limited: existing evaluations often focus on static prompts, isolated story generations, or post-hoc ratings, and therefore miss whether models can jointly manage story generation, long-context state and pacing, character simulation, empathic personalization, and story-grounded artifacts. We introduce NARRA-Gym, an executable evaluation environment that turns a sparse emotional seed into a complete interactive story episode and logs the full model-in-the-loop trajectory, including story construction, memory updates, planning, pacing interventions, and optional artifact synthesis. We evaluate nine frontier LLMs using a controlled LLM-as-judge sweep over eight benchmark personas and a human evaluation in which participants rate customized model outputs. Our results show substantial variation across models, personas, and evaluation dimensions: models that produce fluent stories can still fail on robustness, user experience, or resistance-sensitive personalization. These findings suggest that interactive narrative offers a useful benchmark for evaluating long-horizon, user-adaptive LLM behavior beyond isolated story quality.

Problem

Research questions and friction points this paper is trying to address.

interactive narrative

LLM evaluation

long-context reasoning

user-adaptive behavior

story coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive narrative

NARRA-Gym

model-in-the-loop evaluation

empathic personalization

long-context storytelling

🔎 Similar Papers

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

2024-09-10arXiv.orgCitations: 0

Agents' Room: Narrative Generation through Multi-step Collaboration

2024-10-03arXiv.orgCitations: 4