🤖 AI Summary
A reproducible software engineering benchmark tailored for C# is currently lacking, hindering rigorous evaluation and advancement of AI coding agents for enterprise-grade languages. Method: We introduce SWE-Sharp-Bench—the first C#-specific benchmark comprising 150 real-world bug-fixing tasks derived from 17 open-source repositories, rigorously aligned with the SWE-Bench protocol to ensure automated evaluation and full reproducibility. We publicly release the entire dataset construction pipeline and conduct cross-language evaluation under uniform model agent configurations. Contribution/Results: Experiments reveal that state-of-the-art models achieve only a 40% task-solving rate on C#—significantly lower than 70% on Python—quantifying, for the first time, a critical performance gap in C# code intelligence. This work bridges a key gap in multilingual code intelligence evaluation and establishes a standardized infrastructure for C# code generation, repair, and agent research.
📝 Abstract
AI coding agents have shown great progress on Python software engineering benchmarks like SWE-Bench, and for other languages like Java and C in benchmarks like Multi-SWE-Bench. However, C# -- a prominent enterprise language ranking #5 in the TIOBE index -- remains absent from such benchmarks. We introduce SWE-Sharp-Bench, a reproducible software engineering benchmark for C# featuring 150 instances from 17 repositories. Evaluating identical model-agent configurations across languages reveals a significant performance gap: while 70% of Python tasks in SWE-Bench Verified are solved, only 40% of our C# tasks are resolved. We open-source SWE-Sharp-Bench and our entire curation pipeline.