A New Benchmark for Evaluating Code Translation with Third-Party Libraries

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code translation benchmarks inadequately support third-party libraries (TPLs), obscuring LLM failures in library-dependent scenarios. To address this, we introduce TransLibEval—the first TPL-oriented code translation benchmark—comprising 200 real-world tasks across Python, Java, and C++, spanning data processing, machine learning, and web development. We design the first evaluation framework dedicated to TPL-aware translation, integrating high-coverage unit testing with fine-grained dependency analysis. Our evaluation reveals that mainstream LLMs suffer over a 60% average accuracy drop in library-aware translation and uncovers previously overlooked third-party reference errors. Through controlled experiments comparing six translation strategies—including IR-guided and retrieval-augmented approaches—we demonstrate significant performance disparities among them. This work establishes an empirical foundation and actionable optimization pathways for advancing library-aware code intelligence.

Technology Category

Application Category

📝 Abstract
In recent years, Large Language Models (LLMs) have been widely studied in the code translation field on the method, class, and even repository levels. However, most of these benchmarks are limited in terms of Third-Party Library (TPL) categories and scales, making TPL-related errors hard to expose and hindering the development of targeted solutions. Considering the high dependence (over 90%) on TPLs in practical programming, demystifying and analyzing LLMs' code translation performance involving various TPLs becomes imperative. To address this gap, we construct TransLibEval, the first benchmark dedicated to library-centric code translation. It consists of 200 real-world tasks across Python, Java, and C++, each explicitly involving TPLs from diverse categories such as data processing, machine learning, and web development, with comprehensive dependency coverage and high-coverage test suites. We evaluate seven recent LLMs of commercial, general, and code-specialized families under six translation strategies of three categories: Direct, IR-guided, and Retrieval-augmented. Experimental results show a dramatic performance drop compared with library-free settings (average CA decline over 60%), while diverse strategies demonstrate heterogeneous advantages. Furthermore, we analyze 4,831 failed cases from GPT-4o, one of the State-of-the-Art (SOTA) LLMs, revealing numerous third-party reference errors that were obscured previously. These findings highlight the unique challenges of library-centric translation and provide practical guidance for improving TPL-aware code intelligence.
Problem

Research questions and friction points this paper is trying to address.

Evaluating code translation performance with third-party libraries
Addressing limited benchmark coverage for library dependencies
Analyzing LLM errors in real-world TPL-centric tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed TransLibEval benchmark for library translation
Evaluated LLMs across multiple strategies and languages
Analyzed TPL reference errors in translation failures
🔎 Similar Papers
No similar papers found.
P
Pengyu Xue
Shandong University, China
K
Kunwu Zheng
Shandong University, China
Z
Zhen Yang
Shandong University, China
Y
Yifei Pei
Shandong University, China
L
Linhao Wu
Shandong University, China
J
Jiahui Dong
Shandong University, China
Xiapu Luo
Xiapu Luo
The Hong Kong Polytechnic University
Mobile SecuritySmart ContractsNetwork SecurityBlockchainSoftware Engineering
Y
Yan Xiao
Sun Yat-sen University, China
F
Fei Liu
Shandong University, China
Y
Yuxuan Zhang
Shandong University, China
X
Xiran Lyu
Shandong University, China
Xianhang Li
Xianhang Li
Ph.D. in UCSC
Computer Vision
X
Xuanyu Zhu
Shandong University, China
Chengyi Wang
Chengyi Wang
Bytedance Inc
Large language model