Copilot Arena: A Platform for Code LLM Evaluation in the Wild

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitation of conventional benchmarks in accurately reflecting large language models’ (LLMs) real-world coding capabilities. To this end, we introduce the first code generation evaluation platform embedded natively within an IDE. Methodologically, we propose a lightweight dual-model comparison interface, a low-latency dynamic sampling strategy, and a task-adaptive, context-aware prompting mechanism to enable real-time, developer-provided pairwise preference feedback on code suggestions. Our contributions are threefold: (1) We empirically demonstrate—through extensive in-IDE deployment—that model performance rankings under realistic development conditions substantially diverge from those reported on standard benchmarks (e.g., HumanEval); (2) we uncover novel patterns regarding cross-language capability consistency and task sensitivity; and (3) the platform has delivered over 4.5 million code suggestions and collected more than 11,000 high-quality human pairwise judgments. Both the platform and the annotated dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no clear solution. We introduce Copilot Arena, a platform to collect user preferences for code generation through native integration into a developer's working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy optimized to reduce latency, and a prompting scheme to enable code completion functionality. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from Copilot Arena differ from those of existing evaluations, which we attribute to the more realistic distribution of data and tasks contained in Copilot Arena. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We open-source Copilot Arena and release data to enable human-centric evaluations and improve understanding of coding assistants.
Problem

Research questions and friction points this paper is trying to address.

Evaluate real-world coding capabilities
Develop platform for code LLM evaluation
Analyze human preferences for code generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

User preference collection platform
Latency-optimized sampling strategy
Code completion prompting scheme
🔎 Similar Papers
No similar papers found.