EvoClaw: Evaluating AI Agents on Continuous Software Evolution

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing AI agent evaluation benchmarks predominantly focus on isolated, one-off coding tasks, failing to capture the temporal dependencies and technical debt inherent in real-world software evolution. To address this gap, this work proposes EvoClaw—the first benchmark specifically designed for evaluating AI agents in continuous software evolution. EvoClaw leverages a DeepCommit pipeline to reconstruct semantically coherent and verifiable milestone-directed acyclic graphs (DAGs) from noisy commit logs, integrating semantic goal clustering with long-term evolutionary trajectory assessment. Evaluations across four agent frameworks and twelve state-of-the-art models reveal a stark performance drop—from over 80% on isolated tasks to as low as 38% in continuous evolution scenarios—highlighting critical deficiencies in current AI agents’ capabilities for sustained system maintenance.

Technology Category

Application Category

📝 Abstract

With AI agents increasingly deployed as long-running systems, it becomes essential to autonomously construct and continuously evolve customized software to enable interaction within dynamic environments. Yet, existing benchmarks evaluate agents on isolated, one-off coding tasks, neglecting the temporal dependencies and technical debt inherent in real-world software evolution. To bridge this gap, we introduce DeepCommit, an agentic pipeline that reconstructs verifiable Milestone DAGs from noisy commit logs, where milestones are defined as semantically cohesive development goals. These executable sequences enable EvoClaw, a novel benchmark that requires agents to sustain system integrity and limit error accumulation, dimensions of long-term software evolution largely missing from current benchmarks. Our evaluation of 12 frontier models across 4 agent frameworks reveals a critical vulnerability: overall performance scores drop significantly from $>$80% on isolated tasks to at most 38% in continuous settings, exposing agents' profound struggle with long-term maintenance and error propagation.

Problem

Research questions and friction points this paper is trying to address.

continuous software evolution

AI agents

technical debt

temporal dependencies

long-term maintenance

Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous software evolution

Milestone DAG

error accumulation