🤖 AI Summary
Existing GUI agent benchmarks suffer from insufficient narrative diversity, hindering evaluation of agents’ ability to complete coherent, multi-step story arcs—particularly in adventure games requiring long-term memory and sequential reasoning, where a significant observation-action gap persists. To address this, we introduce FlashAdventure, the first benchmark comprising 34 Flash-based adventure games designed to systematically evaluate end-to-end narrative task completion. We propose COAST, a novel agent framework integrating long-horizon clue memory, LLM-driven GUI interaction, and prompt-engineered, stage-wise planning. Additionally, we present CUA-as-a-Judge, an automated evaluation method for narrative fidelity and task progress. Experiments reveal that state-of-the-art GUI agents perform poorly on full story arcs; COAST substantially improves milestone completion rates yet remains notably below human performance, underscoring the critical need for improved long-term memory modeling and narrative understanding in GUI agents.
📝 Abstract
GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.