Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

πŸ“… 2025-04-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing code repair benchmarks (e.g., SWE-bench) are heavily Python-centric, hindering rigorous evaluation of large language models’ (LLMs) generalization across diverse programming languages. Method: We introduce Multi-SWE, the first high-quality, multi-language code repair benchmark covering Java, TypeScript, JavaScript, Go, Rust, C, and C++, comprising 1,632 expert-annotated instances. It enables unified evaluation across both systems and scripting languages. We open-source the full data production pipeline and 4,723 structured reinforcement learning training samples, and propose a tripartite agent-based evaluation framework built upon Agentless, SWE-agent, and OpenHands. Contribution/Results: Experiments reveal substantial performance degradation of current LLMs on non-Python languages, identifying key bottlenecks in multilingual generalization. Multi-SWE establishes a foundational resource for evaluating and training multilingual LLMs, advancing robust, language-agnostic code intelligence research.

Technology Category

Application Category

πŸ“ Abstract
The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.
Problem

Research questions and friction points this paper is trying to address.

Lack of multilingual benchmarks for evaluating LLMs in issue resolving
Need for diverse programming languages coverage in codebase modification tasks
Absence of large-scale RL training datasets for multilingual issue resolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark covering seven programming languages
High-quality instances annotated by expert annotators
Open-source data production pipeline and tutorials
πŸ”Ž Similar Papers
No similar papers found.
Daoguang Zan
Daoguang Zan
ByteDance Seed
Large Language ModelSoftware EngineeringCoding Agent
Zhirong Huang
Zhirong Huang
SLAC and Stanford University
Accelerator PhysicsFree Electron Lasers
W
Wei Liu
H
Hanwu Chen
L
Linhao Zhang
S
Shulin Xin
L
Lu Chen
Q
Qi Liu
X
Xiaojian Zhong
A
Aoyan Li
S
Siyao Liu
Y
Yongsheng Xiao
L
Liangqiang Chen
Yuyu Zhang
Yuyu Zhang
Research Scientist, ByteDance
Machine Learning
J
Jing Su
T
Tianyu Liu
R
Rui Long
Kai Shen
Kai Shen
Associate Professor of Computer Science, University of Rochester
Computer Systems
L
Liang Xiang