An Empirical Study of Proactive Coding Assistants in Real-World Software Development

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses a critical limitation in existing research on proactive programming assistants, which relies on IDE interaction data simulated by large language models and fails to capture authentic developer behavior, leading to biased evaluations. To bridge this gap, the authors developed a VS Code extension to collect real-world interaction logs from 1,246 industrial developers and constructed paired model-simulated trajectories. Their analysis reveals significant discrepancies between real and simulated behaviors in terms of action diversity, temporal structure, and exploration patterns. Building on these insights, they introduce ProCodeBench—the first real-world benchmark for evaluating proactive programming assistants. Experiments demonstrate that state-of-the-art methods perform substantially worse on real trajectories than on simulated ones, underscoring the essential role of authentic behavioral data in developing effective assistants, while suggesting simulated data may still serve as a supplementary training resource.

📝 Abstract

Large language model (LLM)-based coding assistants have made substantial progress, yet most systems remain reactive, requiring developers to explicitly formulate their needs. Proactive coding assistants aim to infer latent developer intent from integrated development environment (IDE) interactions and repository context, thereby reducing interaction overhead and supporting more seamless assistance. However, research in this direction is limited by the scarcity of large-scale real-world developer behavior data. Existing studies therefore often rely on LLM-simulated IDE traces, whose fidelity to real development behavior remains unclear. In this paper, we investigate this simulation-to-reality gap through a large-scale empirical study. We collect real IDE interaction traces from 1{,}246 experienced industry developers over three consecutive days using a custom Visual Studio Code extension, and construct paired LLM-simulated traces for controlled comparison. Our analysis shows that simulated traces differ substantially from real traces in behavioral diversity, temporal structure, and exploratory patterns. Based on the collected data, we introduce \textbf{ProCodeBench}, a real-world benchmark for proactive intent prediction. Experiments with representative LLMs, retrieval-augmented methods, and agentic baselines show that current approaches remain far from reliable under real IDE traces, suggesting that simulation-based evaluation can overestimate real-world performance. Finally, our training study shows that simulated data cannot replace real data, but can complement it when used before real-world fine-tuning. These findings highlight the importance of real developer behavior data for evaluating and training proactive coding assistants.

Problem

Research questions and friction points this paper is trying to address.

proactive coding assistants

real-world developer behavior

IDE interaction traces

simulation-to-reality gap

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

proactive coding assistants

real-world developer behavior

simulation-to-reality gap