When Code Crosses Borders: A Security-Centric Evaluation of LLM-based Code Translation

๐Ÿ“… 2025-09-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing evaluations of large language model (LLM)-based cross-lingual code translation overlook security vulnerabilities, failing to assess how translations preserve or introduce security flaws. Method: We propose STED, the first security-oriented code translation evaluation framework: (1) a benchmark dataset spanning five programming languages and nine high-risk CWE vulnerability classes; (2) a dual-mode evaluation protocol combining human expert review and LLM-as-a-judge to systematically quantify functional correctness, vulnerability retention, and vulnerability introduction. Contribution/Results: We uncover widespread security degradation across mainstream LLMsโ€”particularly for web-related vulnerabilities such as input validation flaws. Furthermore, we introduce a RAG-based mitigation strategy that reduces the new-vulnerability introduction rate by 32.8% (from 28.6%โ€“45% to 19.2%โ€“30.2%) across 6,000 translation instances, significantly enhancing translation security.

Technology Category

Application Category

๐Ÿ“ Abstract
With the growing demand for cross-language codebase migration, evaluating LLMs' security implications in translation tasks has become critical. Existing evaluations primarily focus on syntactic or functional correctness at the function level, neglecting the critical dimension of security. To enable security evaluation, we construct STED (Security-centric Translation Evaluation Dataset), the first dataset specifically designed for evaluating the security implications of LLM-based code translation. It comprises 720 security-related code samples across five programming languages and nine high-impact CWE categories, sourced from CVE/NVD and manually verified for translation tasks. Our evaluation framework consists of two independent assessment modules: (1) rigorous evaluation by security researchers, and (2) automated analysis via LLM-as-a-judge. Together they evaluate three critical aspects: functional correctness, vulnerability preservation, and vulnerability introduction rates. Our large-scale evaluation of five state-of-the-art LLMs across 6,000 translation instances reveals significant security degradation, with 28.6-45% of translations introducing new vulnerabilities--particularly for web-related flaws like input validation, where LLMs show consistent weaknesses. Furthermore, we develop a Retrieval-Augmented Generation (RAG)-based mitigation strategy that reduces translation-induced vulnerabilities by 32.8%, showing the potential of knowledge-enhanced prompting.
Problem

Research questions and friction points this paper is trying to address.

Evaluating security implications in LLM-based code translation
Assessing vulnerability introduction and preservation rates
Developing mitigation strategies for translation-induced security degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed STED dataset for security evaluation
Developed dual assessment framework with human and automated analysis
Implemented RAG-based mitigation to reduce vulnerabilities
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Hailong Chang
Institute of Information Engineering, CAS, School of Cybersecurity, UCAS, Beijing, China
Guozhu Meng
Guozhu Meng
Associate Professor with Chinese Academy of Sciences
mobile securityprogram analysisAI privacy and security
S
Shuhui Xiao
Institute of Information Engineering, CAS, School of Cybersecurity, UCAS, Beijing, China
K
Kai Chen
Institute of Information Engineering, CAS, School of Cybersecurity, UCAS, Beijing, China
K
Kun Sun
Institute of Information Engineering, CAS, School of Cybersecurity, UCAS, Beijing, China
Yilin Li
Yilin Li
University of Washington
conjugated polymersluminescent solar concentrators