FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

📅 2025-08-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI agent benchmarks suffer from insufficient narrative diversity, hindering evaluation of agents’ ability to complete coherent, multi-step story arcs—particularly in adventure games requiring long-term memory and sequential reasoning, where a significant observation-action gap persists. To address this, we introduce FlashAdventure, the first benchmark comprising 34 Flash-based adventure games designed to systematically evaluate end-to-end narrative task completion. We propose COAST, a novel agent framework integrating long-horizon clue memory, LLM-driven GUI interaction, and prompt-engineered, stage-wise planning. Additionally, we present CUA-as-a-Judge, an automated evaluation method for narrative fidelity and task progress. Experiments reveal that state-of-the-art GUI agents perform poorly on full story arcs; COAST substantially improves milestone completion rates yet remains notably below human performance, underscoring the critical need for improved long-term memory modeling and narrative understanding in GUI agents.

Technology Category

Application Category

📝 Abstract
GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.
Problem

Research questions and friction points this paper is trying to address.

Evaluating GUI agents on full story arc completion in games
Addressing the observation-behavior gap in sequential gameplay
Benchmarking diverse adventure games with complex narrative interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-term clue memory for sequential tasks
Automated gameplay evaluator for performance assessment
Agentic framework bridging observation-behavior gap
🔎 Similar Papers
No similar papers found.