π€ AI Summary
Existing code repair benchmarks (e.g., SWE-bench) are heavily Python-centric, hindering rigorous evaluation of large language modelsβ (LLMs) generalization across diverse programming languages.
Method: We introduce Multi-SWE, the first high-quality, multi-language code repair benchmark covering Java, TypeScript, JavaScript, Go, Rust, C, and C++, comprising 1,632 expert-annotated instances. It enables unified evaluation across both systems and scripting languages. We open-source the full data production pipeline and 4,723 structured reinforcement learning training samples, and propose a tripartite agent-based evaluation framework built upon Agentless, SWE-agent, and OpenHands.
Contribution/Results: Experiments reveal substantial performance degradation of current LLMs on non-Python languages, identifying key bottlenecks in multilingual generalization. Multi-SWE establishes a foundational resource for evaluating and training multilingual LLMs, advancing robust, language-agnostic code intelligence research.
π Abstract
The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.