LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenge of “context decay” faced by language agents operating in dynamically growing long-context environments, for which effective evaluation benchmarks are currently lacking. To bridge this gap, we introduce LOCA-bench, the first agent evaluation benchmark supporting infinitely extensible yet semantically consistent long contexts. By leveraging automated environment state control and dynamic context generation, LOCA-bench enables controllable extension of context length while preserving task semantics. The framework jointly evaluates both language models and context management strategies, encompassing diverse management approaches within an end-to-end evaluation pipeline. Experimental results demonstrate that advanced context management techniques substantially improve agent task success rates under extremely long contexts. We release an open-source, extensible evaluation platform to support community research, filling a critical void in assessing long-horizon operational capabilities in realistic scenarios.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly capable of carrying out long-running, real-world tasks. However, as the amount of context grows, their reliability often deteriorates, a phenomenon known as"context rot". Existing long-context benchmarks primarily focus on single-step settings that evaluate a model's ability to retrieve information from a long snippet. In realistic scenarios, however, LLMs often need to act as agents that explore environments, follow instructions and plans, extract useful information, and predict correct actions under a dynamically growing context. To assess language agents in such settings, we introduce LOCA-bench (a benchmark for LOng-Context Agents). Given a task prompt, LOCA-bench leverages automated and scalable control of environment states to regulate the agent's context length. This design enables LOCA-bench to extend the context length potentially to infinity in a controlled way while keeping the underlying task semantics fixed. LOCA-bench evaluates language agents as a combination of models and scaffolds, including various context management strategies. While agent performance generally degrades as the environment states grow more complex, advanced context management techniques can substantially improve the overall success rate. We open-source LOCA-bench to provide a platform for evaluating models and scaffolds in long-context, agentic scenarios: https://github.com/hkust-nlp/LOCA-bench

Problem

Research questions and friction points this paper is trying to address.

context rot

language agents

long-context evaluation

dynamic context growth

agent reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

LOCA-bench

long-context agents

context rot