🤖 AI Summary
This work investigates fundamental limitations of language agents in ultra-long-horizon decision-making tasks (up to 100,000 steps). To address the lack of standardized benchmarks, we introduce TextAtari—the first large-scale text-based reinforcement learning benchmark—converting Atari game visual states into structured natural-language descriptions across nearly 100 diverse tasks. We propose AtariARI, an unsupervised representation learning framework that reliably maps visual states to semantically grounded textual abstractions. We systematically evaluate state-of-the-art open-weight language models—including Qwen2.5-7B, Gemma-7B, and Llama3.1-8B—under zero-shot and few-shot chain-of-thought and reflective reasoning paradigms for long-horizon planning. Results reveal substantial performance gaps versus human players in state tracking, cross-step reasoning, and strategic planning, exposing core deficiencies in semantic grounding and persistent instruction following. To foster reproducible research, we publicly release the benchmark, evaluation protocols, and baseline implementations—establishing a standardized infrastructure for long-duration language agent research.
📝 Abstract
We present TextAtari, a benchmark for evaluating language agents on very long-horizon decision-making tasks spanning up to 100,000 steps. By translating the visual state representations of classic Atari games into rich textual descriptions, TextAtari creates a challenging test bed that bridges sequential decision-making with natural language processing. The benchmark includes nearly 100 distinct tasks with varying complexity, action spaces, and planning horizons, all rendered as text through an unsupervised representation learning framework (AtariARI). We evaluate three open-source large language models (Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks (zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess how different forms of prior knowledge affect performance on these long-horizon challenges. Four scenarios-Basic, Obscured, Manual Augmentation, and Reference-based-investigate the impact of semantic understanding, instruction comprehension, and expert demonstrations on agent decision-making. Our results reveal significant performance gaps between language agents and human players in extensive planning tasks, highlighting challenges in sequential reasoning, state tracking, and strategic planning across tens of thousands of steps. TextAtari provides standardized evaluation protocols, baseline implementations, and a framework for advancing research at the intersection of language models and planning.