AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work addresses the limitation of existing benchmarks, which predominantly focus on static retrieval and fail to evaluate large language model agents’ nonlinear reasoning and iterative feedback capabilities in dynamic, long-context environments. The authors propose the first dynamic long-context evaluation framework grounded in an environment rollback mechanism, driven by lateral thinking puzzles to generate interactive trajectories spanning both knowledge-intensive and knowledge-free scenarios. The framework enables systematic assessment of models and their memory systems across scales from 32K to 4M tokens, introducing “minimum tokens required for resolution” as a key metric for performance degradation. Experiments reveal that while state-of-the-art models perform well on static tasks, they exhibit significant degradation in dynamic information synthesis under high-density tool responses, exposing a critical bottleneck in their reasoning capabilities.

Technology Category

Application Category

📝 Abstract

The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce \textbf{AgentLongBench}, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum number of tokens required to resolve a query. This factor explains why the high information density inherent in massive tool responses poses a significantly greater challenge than the memory fragmentation typical of long-turn dialogues.

Problem

Research questions and friction points this paper is trying to address.

long-context agents

agent-environment interaction

dynamic information synthesis

benchmarking

Lateral Thinking Puzzles

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-context agents

environment rollouts

dynamic information synthesis