🤖 AI Summary
This paper addresses the “understanding desynchronization” problem—where LLM-based agents in collaborative software engineering (CSE) fail to integrate due to dynamic environment evolution. We formally characterize this phenomenon for the first time and introduce SyncBench, the first desynchronization benchmark grounded in real GitHub repositories and validated via executable unit tests (24,332 instances). We propose a measurable, verifiable evaluation framework for desynchronization recovery capability, incorporating critical dimensions such as collaborative intent and resource awareness. Extensive experiments across models—including Llama-3.1 and Claude-3.5—on real-world open-source projects reveal alarmingly low recovery rates (3.33%–28.18%), suboptimal collaborative intent (≤4.86/10), and severe deficits in resource awareness. Our findings expose fundamental limitations of current LLM agents in collaborative SE and establish SyncBench as a foundational benchmark and empirical basis for designing robust, adaptive agent systems.
📝 Abstract
Software engineering (SE) is increasingly collaborative, with developers working together on shared complex codebases. Effective collaboration in shared environments requires participants -- whether humans or AI agents -- to stay on the same page as their environment evolves. When a collaborator's understanding diverges from the current state -- what we term the out-of-sync challenge -- the collaborator's actions may fail, leading to integration issues. In this work, we introduce SyncMind, a framework that systematically defines the out-of-sync problem faced by large language model (LLM) agents in collaborative software engineering (CSE). Based on SyncMind, we create SyncBench, a benchmark featuring 24,332 instances of agent out-of-sync scenarios in real-world CSE derived from 21 popular GitHub repositories with executable verification tests. Experiments on SyncBench uncover critical insights into existing LLM agents' capabilities and limitations. Besides substantial performance gaps among agents (from Llama-3.1 agent<= 3.33% to Claude-3.5-Sonnet>= 28.18%), their consistently low collaboration willingness (<= 4.86%) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates with out-of-sync recovery success. Minimal performance differences in agents' resource-aware out-of-sync recoveries further reveal their significant lack of resource awareness and adaptability, shedding light on future resource-efficient collaborative systems. Code and data are openly available on our project website: https://xhguo7.github.io/SyncMind/.