Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation

📅 2024-03-25

🏛️ 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (Forge) Conference Acronym:

📈 Citations: 22

✨ Influential: 0

📄 PDF

career value

177K/year

🤖 AI Summary

Large language models (LLMs) exhibit pervasive output formatting bias in code translation tasks—generated outputs frequently contain extraneous natural-language explanations or formatting delimiters, causing standard evaluation metrics (e.g., computation accuracy, CA) to systematically underestimate true performance. Method: We systematically evaluate 11 instruction-tuned LLMs across five programming languages and find that 26.4%–73.7% of translations require post-hoc processing to extract clean code. To address this, we propose a robust code extraction method integrating regex-based parsing with prompt engineering. Contribution/Results: Our approach achieves a 92.73% average Code Extraction Success Rate (CSR) on a multilingual alignment benchmark, substantially improving evaluation fidelity. This work is the first to quantify the impact of formatting bias and establishes a new, generalizable, and robust code extraction paradigm—providing a reproducible, standardized evaluation benchmark for LLM-based code translation.

Technology Category

Application Category

📝 Abstract

Code translation between programming languages is a long-existing and critical task in software engineering, facilitating the modernization of legacy systems, ensuring cross-platform compatibility, and enhancing software performance. With the recent advances in large language models (LLMs) and their applications to code translation, there is an increasing need for comprehensive evaluation of these models. In this study, we empirically analyze the generated outputs of eleven popular instruct-tuned LLMs with parameters ranging from 1B up to 46.7B on 3,820 translation pairs across five languages, including C, C++, Go, Java, and Python. Our analysis found that between 26.4% and 73.7% of code translations produced by our evaluated LLMs necessitate post-processing, as these translations often include a mix of code, quotes, and text rather than being purely source code. Overlooking the output format of these models can inadvertently lead to underestimation of their actual performance. This is particularly evident when evaluating them with execution-based metrics such as Computational Accuracy (CA). Our results demonstrate that a strategic combination of prompt engineering and regular expression can effectively extract the source code from the model generation output. In particular, our method can help eleven selected models achieve an average Code Extraction Success Rate (CSR) of 92.73%. Our findings shed light on and motivate future research to conduct more reliable benchmarks of LLMs for code translation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM code translation suffers from output format biases

Non-code elements in outputs interfere with performance assessment metrics

Proposing methods to extract source code for reliable model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt engineering and regex extract code

Mitigate bias in mixed-format LLM outputs

Achieve 92.73% code extraction success rate

🔎 Similar Papers

No similar papers found.