🤖 AI Summary
Current evaluations of code translation often misattribute model failures to incorrect outputs when, in fact, the errors stem from improper compilation flags, library linking issues, or environmental misconfigurations—thereby obscuring true model performance. This work presents the first systematic identification and categorization of such model-agnostic “pseudo-failures.” Analyzing 6,164 code translation samples generated by GPT-4o, DeepSeek-Coder, and Magicoder across five languages (C, C++, Java, Python, and Go) on the Avatar, CodeNet, and EvalPlus benchmarks, the study quantifies the substantial impact of evaluation configuration flaws on reported results. It reveals that a significant portion of alleged failures are actually due to environmental errors rather than logical inaccuracies. The findings advocate shifting the focus of code translation evaluation from mere logical correctness toward end-to-end reliability and call for transparent, configuration-aware evaluation standards.
📝 Abstract
Large Language Models (LLMs) have achieved remarkable success in automated code translation. While prior work has focused on improving translation accuracy through advanced prompting and iterative repair, the reliability of the underlying evaluation frameworks has received less attention. In this paper, we demonstrate that a significant number of reported failures in code translation are not due to incorrect logic, but rather evaluation-induced errors stemming from improper compilation flags, missing library links, and unconfigured runtime environments. We conduct a large-scale empirical study across five programming languages (C, C++, Java, Python, Go) and three benchmarks (Avatar, CodeNet, EvalPlus), covering 6,164 translations generated by GPT-4o, DeepSeek-Coder, and Magicoder. Our analysis identifies and categorizes common false negatives, distinguishing pipeline-induced failures that affect any model from model-dependent behaviors that vary across LLMs. Our findings highlight the necessity for transparent, configuration-aware evaluation standards to accurately assess progress in LLM-based code translation.