๐ค AI Summary
Existing evaluations of large language model (LLM)-based cross-lingual code translation overlook security vulnerabilities, failing to assess how translations preserve or introduce security flaws. Method: We propose STED, the first security-oriented code translation evaluation framework: (1) a benchmark dataset spanning five programming languages and nine high-risk CWE vulnerability classes; (2) a dual-mode evaluation protocol combining human expert review and LLM-as-a-judge to systematically quantify functional correctness, vulnerability retention, and vulnerability introduction. Contribution/Results: We uncover widespread security degradation across mainstream LLMsโparticularly for web-related vulnerabilities such as input validation flaws. Furthermore, we introduce a RAG-based mitigation strategy that reduces the new-vulnerability introduction rate by 32.8% (from 28.6%โ45% to 19.2%โ30.2%) across 6,000 translation instances, significantly enhancing translation security.
๐ Abstract
With the growing demand for cross-language codebase migration, evaluating LLMs' security implications in translation tasks has become critical. Existing evaluations primarily focus on syntactic or functional correctness at the function level, neglecting the critical dimension of security.
To enable security evaluation, we construct STED (Security-centric Translation Evaluation Dataset), the first dataset specifically designed for evaluating the security implications of LLM-based code translation. It comprises 720 security-related code samples across five programming languages and nine high-impact CWE categories, sourced from CVE/NVD and manually verified for translation tasks. Our evaluation framework consists of two independent assessment modules: (1) rigorous evaluation by security researchers, and (2) automated analysis via LLM-as-a-judge. Together they evaluate three critical aspects: functional correctness, vulnerability preservation, and vulnerability introduction rates.
Our large-scale evaluation of five state-of-the-art LLMs across 6,000 translation instances reveals significant security degradation, with 28.6-45% of translations introducing new vulnerabilities--particularly for web-related flaws like input validation, where LLMs show consistent weaknesses. Furthermore, we develop a Retrieval-Augmented Generation (RAG)-based mitigation strategy that reduces translation-induced vulnerabilities by 32.8%, showing the potential of knowledge-enhanced prompting.