The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing agent evaluation benchmarks predominantly rely on static environments, which fail to adequately assess agents’ adaptability in dynamic, real-world scenarios. To address this limitation, this work proposes ProEvolve, a novel framework that models environment evolution as a programmable graph transformation process. By representing data, tools, and interaction patterns within a unified typed relational graph, ProEvolve leverages graph rewriting and subgraph sampling to automatically generate diverse task sandboxes. This approach enables systematic and scalable evaluation of agent adaptability. In experiments, ProEvolve evolved a single initial environment into 200 dynamic environments and produced 3,000 task sandboxes, successfully facilitating robustness evaluations of mainstream agents.

Technology Category

Application Category

📝 Abstract

LLM-powered agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks assume static environments with fixed schemas and toolsets, neglecting the evolutionary nature of the real world and agents'robustness to environmental changes. In this paper, we study a crucial problem: how to evolve the agent environment in a scalable and controllable way, thereby better evaluating agents'adaptability to real-world dynamics. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment: data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve can (1) program the evolutionary dynamics as graph transformations to generate environments automatically, and (2) instantiate task sandboxes via subgraph sampling and programming. We validate ProEvolve by evolving a single environment into 200 environments and 3,000 task sandboxes, and benchmark representative agents accordingly.

Problem

Research questions and friction points this paper is trying to address.

agent benchmark

environment evolution

real-world dynamics

adaptability evaluation

programmable evolution

Innovation

Methods, ideas, or system contributions that make the work stand out.

programmable evolution

typed relational graph

environment evolution