TALES: Text Adventure Learning Environment Suite

📅 2025-04-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the weak reasoning capabilities of large language models (LLMs) in long-horizon, structured, context-aware sequential decision-making tasks. To this end, we introduce the first benchmark suite—Text Adventure Games for Reasoning Evaluation (TAG-RE)—comprising diverse synthetic and human-authored games. Methodologically, we propose the first systematic distinction between two orthogonal game difficulty dimensions and a fine-grained reasoning evolution evaluation paradigm, integrating multi-model assessment (open- and closed-weight LLMs), behavioral trajectory visualization, and qualitative attribution analysis. Experiments reveal that state-of-the-art LLMs achieve strong performance on synthetic games but attain an average win rate below 15% on human-authored games, exposing fundamental limitations in robustness, hierarchical planning, and goal-directed reasoning. TAG-RE enables reproducible, diagnostic reasoning evaluation and establishes a new standard and analytical toolkit for advancing LLM reasoning research.

Technology Category

Application Category

📝 Abstract

Reasoning is an essential skill to enable Large Language Models (LLMs) to interact with the world. As tasks become more complex, they demand increasingly sophisticated and diverse reasoning capabilities for sequential decision-making, requiring structured reasoning over the context history to determine the next best action. We introduce TALES, a diverse collection of synthetic and human-written text-adventure games designed to challenge and evaluate diverse reasoning capabilities. We present results over a range of LLMs, open- and closed-weights, performing a qualitative analysis on the top performing models. Despite an impressive showing on synthetic games, even the top LLM-driven agents fail to achieve 15% on games designed for human enjoyment. Code and visualization of the experiments can be found at https://microsoft.github.io/tales.

Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning skills in Large Language Models

Challenging models with diverse text-adventure games

Assessing performance gaps between synthetic and human games

Innovation

Methods, ideas, or system contributions that make the work stand out.

TALES suite tests diverse reasoning in LLMs

Synthetic and human-written games challenge models

Qualitative analysis of top LLM performance

🔎 Similar Papers

No similar papers found.

Authors to Follow