RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation

📅 2024-12-23

🏛️ arXiv.org

📈 Citations: 21

✨ Influential: 1

career value

160K/year

🤖 AI Summary

Existing code translation benchmarks are limited to snippets, functions, or files, failing to reflect real-world library-scale migration requirements. This paper introduces RepoTransBench, the first repository-level code translation benchmark supporting cross-language, end-to-end executable testing and functional correctness verification. Methodologically, it constructs the first repository-level benchmark with automatically generated test suites and systematically evaluates 11 state-of-the-art large language models (LLMs), revealing a maximum Success@1 of only 7.33%. It identifies fundamental model deficiencies in handling cross-file dependencies, state consistency, and build logic. To address these, the paper proposes an error-feedback-driven multi-round iterative debugging mechanism, boosting performance to 21%. The work establishes a new paradigm for long-horizon, structured code translation research and provides critical empirical foundations for advancing repository-level code migration.

Technology Category

Application Category

📝 Abstract

Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository. Many benchmarks have been proposed to evaluate the performance of such code translators. However, previous benchmarks mostly provide fine-grained samples, focusing at either code snippet, function, or file-level code translation. Such benchmarks do not accurately reflect real-world demands, where entire repositories often need to be translated, involving longer code length and more complex functionalities. To address this gap, we propose a new benchmark, named RepoTransBench, which is a real-world repository-level code translation benchmark with an automatically executable test suite. We conduct experiments on RepoTransBench to evaluate the translation performance of 11 advanced LLMs. We find that the Success@1 score (test success in one attempt) of the best-performing LLM is only 7.33%. To further explore the potential of LLMs for repository-level code translation, we provide LLMs with error-related feedback to perform iterative debugging and observe an average 7.09% improvement on Success@1. However, even with this improvement, the Success@1 score of the best-performing LLM is only 21%, which may not meet the need for reliable automatic repository-level code translation. Finally, we conduct a detailed error analysis and highlight current LLMs' deficiencies in repository-level code translation, which could provide a reference for further improvements.

Problem

Research questions and friction points this paper is trying to address.

Proposes RepoTransBench for repository-level code translation evaluation

Introduces RepoTransAgent framework to handle complex multilingual translation tasks

Analyzes translation challenges and LLM deficiencies in real-world repository scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

RepoTransBench benchmark for multilingual repository-level translation

RepoTransAgent framework for automated repository translation

Analysis of translation difficulty across language pair directions

🔎 Similar Papers

Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation