AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing benchmarks, which predominantly focus on static retrieval and fail to evaluate large language model agents’ nonlinear reasoning and iterative feedback capabilities in dynamic, long-context environments. The authors propose the first dynamic long-context evaluation framework grounded in an environment rollback mechanism, driven by lateral thinking puzzles to generate interactive trajectories spanning both knowledge-intensive and knowledge-free scenarios. The framework enables systematic assessment of models and their memory systems across scales from 32K to 4M tokens, introducing “minimum tokens required for resolution” as a key metric for performance degradation. Experiments reveal that while state-of-the-art models perform well on static tasks, they exhibit significant degradation in dynamic information synthesis under high-density tool responses, exposing a critical bottleneck in their reasoning capabilities.

Technology Category

Application Category

📝 Abstract
The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce \textbf{AgentLongBench}, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum number of tokens required to resolve a query. This factor explains why the high information density inherent in massive tool responses poses a significantly greater challenge than the memory fragmentation typical of long-turn dialogues.
Problem

Research questions and friction points this paper is trying to address.

long-context agents
agent-environment interaction
dynamic information synthesis
benchmarking
Lateral Thinking Puzzles
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-context agents
environment rollouts
dynamic information synthesis
Lateral Thinking Puzzles
AgentLongBench
🔎 Similar Papers
No similar papers found.
S
Shicheng Fang
Fudan University
Yuxin Wang
Yuxin Wang
Fudan University
Xiaoran Liu
Xiaoran Liu
Fudan University
natural language processing
J
Jiahao Lu
Fudan University
C
Chuanyuan Tan
Soochow University
Xinchi Chen
Xinchi Chen
Professor at Fudan University, Shanghai, China
Large Language ModelsEmbodied AINatural Language ProcessingInformation Retrievaletc.
Y
Yining Zheng
Fudan University
X
Xuanjing Huang
Fudan University
X
Xipeng Qiu
Fudan University