Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing code repair benchmarks (e.g., SWE-bench) are heavily Python-centric, hindering rigorous evaluation of large language models’ (LLMs) generalization across diverse programming languages. Method: We introduce Multi-SWE, the first high-quality, multi-language code repair benchmark covering Java, TypeScript, JavaScript, Go, Rust, C, and C++, comprising 1,632 expert-annotated instances. It enables unified evaluation across both systems and scripting languages. We open-source the full data production pipeline and 4,723 structured reinforcement learning training samples, and propose a tripartite agent-based evaluation framework built upon Agentless, SWE-agent, and OpenHands. Contribution/Results: Experiments reveal substantial performance degradation of current LLMs on non-Python languages, identifying key bottlenecks in multilingual generalization. Multi-SWE establishes a foundational resource for evaluating and training multilingual LLMs, advancing robust, language-agnostic code intelligence research.

Technology Category

Application Category

📝 Abstract

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

Problem

Research questions and friction points this paper is trying to address.

Lack of multilingual benchmarks for evaluating LLMs in issue resolving

Need for diverse programming languages coverage in codebase modification tasks

Absence of large-scale RL training datasets for multilingual issue resolution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark covering seven programming languages

High-quality instances annotated by expert annotators

Open-source data production pipeline and tutorials

🔎 Similar Papers

No similar papers found.