🤖 AI Summary
Current large language model agents struggle to proactively identify and leverage unexpected yet critical information in their environments, exhibiting a notable lack of exploratory responsiveness to environmental stimuli. This work introduces the novel concept of “environmental curiosity,” defining it as an agent’s capacity to recognize, reflect upon, and act on serendipitous relevant cues. To systematically evaluate this capability, the authors embed complete task solutions within three established benchmarks—Terminal-Bench, SWE-Bench, and AppWorld—and assess agents’ utilization of such embedded information. Experimental results reveal that while agents frequently encounter these cues (up to 81% detection rate), their actual usage remains strikingly low (as low as 7%). Although optimized configurations yield modest improvements, most agents still overlook crucial environmental signals, exposing a fundamental limitation in current architectures regarding environmental curiosity.
📝 Abstract
LLM-based agents are assumed to integrate environmental observations into their reasoning: discovering highly relevant but unexpected information should naturally lead to a model exploiting its own discoveries. We show that this assumption is false for current LLM-based agents, which struggle to reflect or react to unexpected information. Across three benchmarks (Terminal-Bench, SWE-Bench, AppWorld), we inject complete task solutions into the agent environments to deliberately expose a task's solution to a model. While agents discover these solutions on Terminal-Bench in 79-81% of runs, they interact, or exploit, them in only 37-50% of cases. This gap is starkest in AppWorld: agents see documentation stating that a command "returns the complete solution to this task" in over 90% of attempts but exploit this in fewer than 7% of trials. We show that agents lack what we call environmental curiosity: the capability to recognize and investigate unexpected but relevant observations in response to environmental stimuli. We identify three main factors influencing environmental curiosity: available tools in the agent scaffold, test-time compute, and training data distribution. Our findings identify configurations that maximize curiosity also achieve the best performance on the unmodified benchmarks. Yet even jointly optimized agents still ignore discovered solutions in the majority of trials: current agents use the environment to fetch expected information, but not to revise their strategy or maximally exploit useful stimuli.