🤖 AI Summary
This work addresses the critical yet overlooked issue of execution efficiency in code translation by large language models (LLMs), noting that functional correctness does not guarantee performance adequacy. To this end, the authors introduce TRACE, the first benchmark specifically designed to evaluate the efficiency of LLM-generated code translations, comprising 1,000 efficiency-sensitive tasks across C++, Java, and Python. A stress-testing framework is developed to systematically assess 28 prominent models. The study reveals that 23.5% of functionally correct translations exhibit significant inefficiencies, with 66.4% stemming from mismatches in language constructs. Notably, certain open-source smaller models outperform leading closed-source counterparts in efficiency, while prompt engineering strategies show limited effectiveness. TRACE establishes a new standard for efficiency-aware evaluation in code translation.
📝 Abstract
While Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of \textit{execution efficiency} remains overlooked. We present \textbf{\textsc{trace}}, the first benchmark to explicitly assess efficiency in LLM-translated code. \textsc{trace} includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency degradations often overlooked by small-scale tests. Using \textsc{trace}, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness is not a reliable proxy for efficiency: the correctness leader \textit{Claude-4-think} achieves only mid-level time efficiency, outperformed by smaller open-source LLMs such as \textit{Qwen2.5-Coder-14B-Instruct}. 2) Inefficiency is both prevalent and patterned: 23.5\% of correct translations exhibit pronounced inefficiency, distributed across algorithmic faults (11.9\%), language construct mismatches (66.4\%), and resource mismanagement (21.7\%). 3) Inference-time prompt strategies bring only modest improvements, suggesting that current LLMs lack intrinsic efficiency awareness. Together, our results establish efficiency as an essential dimension of code translation and position \textsc{trace} as a principled foundation for efficiency-oriented evaluation.