LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Large language models still struggle to effectively integrate key capabilities—such as planning, state tracking, and long-context processing—in multi-turn, long-horizon agent tasks. To address this, this work proposes a controllable evaluation environment grounded in an oracle-based counterfactual framework, combining procedurally generated game tasks with tunable complexity and precise oracle interventions (e.g., perfect planning or error-free state tracking). This setup enables, for the first time, a systematic assessment of the isolated contribution of each capability to task success under controlled conditions. Experimental results demonstrate that planning consistently enhances performance, whereas the impact of other skills is highly dependent on environmental structure and model architecture, offering empirical insights and design guidance for future agent development.

Technology Category

Application Category

📝 Abstract

Large language models can perform well on many isolated tasks, yet they continue to struggle on multi-turn, long-horizon agentic problems that require skills such as planning, state tracking, and long context processing. In this work, we aim to better understand the relative importance of advancing these underlying capabilities for success on such tasks. We develop an oracle counterfactual framework for multi-turn problems that asks: how would an agent perform if it could leverage an oracle to perfectly perform a specific task? The change in the agent's performance due to this oracle assistance allows us to measure the criticality of such oracle skill in the future advancement of AI agents. We introduce a suite of procedurally generated, game-like tasks with tunable complexity. These controlled environments allow us to provide precise oracle interventions, such as perfect planning or flawless state tracking, and make it possible to isolate the contribution of each oracle without confounding effects present in real-world benchmarks. Our results show that while some interventions (e.g., planning) consistently improve performance across settings, the usefulness of other skills is dependent on the properties of the environment and language model. Our work sheds light on the challenges of multi-turn agentic environments to guide the future efforts in the development of AI agents and language models.

Problem

Research questions and friction points this paper is trying to address.

long-horizon

multi-turn

agentic tasks

planning

state tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

oracle counterfactual

long-horizon agentic tasks

procedurally generated environments