SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

A reproducible software engineering benchmark tailored for C# is currently lacking, hindering rigorous evaluation and advancement of AI coding agents for enterprise-grade languages. Method: We introduce SWE-Sharp-Bench—the first C#-specific benchmark comprising 150 real-world bug-fixing tasks derived from 17 open-source repositories, rigorously aligned with the SWE-Bench protocol to ensure automated evaluation and full reproducibility. We publicly release the entire dataset construction pipeline and conduct cross-language evaluation under uniform model agent configurations. Contribution/Results: Experiments reveal that state-of-the-art models achieve only a 40% task-solving rate on C#—significantly lower than 70% on Python—quantifying, for the first time, a critical performance gap in C# code intelligence. This work bridges a key gap in multilingual code intelligence evaluation and establishes a standardized infrastructure for C# code generation, repair, and agent research.

Technology Category

Application Category

📝 Abstract

AI coding agents have shown great progress on Python software engineering benchmarks like SWE-Bench, and for other languages like Java and C in benchmarks like Multi-SWE-Bench. However, C# -- a prominent enterprise language ranking #5 in the TIOBE index -- remains absent from such benchmarks. We introduce SWE-Sharp-Bench, a reproducible software engineering benchmark for C# featuring 150 instances from 17 repositories. Evaluating identical model-agent configurations across languages reveals a significant performance gap: while 70% of Python tasks in SWE-Bench Verified are solved, only 40% of our C# tasks are resolved. We open-source SWE-Sharp-Bench and our entire curation pipeline.

Problem

Research questions and friction points this paper is trying to address.

Addresses the absence of C# benchmarks for AI coding agents

Evaluates performance gap between Python and C# software engineering tasks

Provides reproducible benchmark with 150 C# instances from 17 repositories

Innovation

Methods, ideas, or system contributions that make the work stand out.

SWE-Sharp-Bench introduces C# software engineering benchmark

Benchmark contains 150 instances from 17 repositories

Open-sources entire curation pipeline for reproducibility

🔎 Similar Papers

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark