When Code Crosses Borders: A Security-Centric Evaluation of LLM-based Code Translation

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing evaluations of large language model (LLM)-based cross-lingual code translation overlook security vulnerabilities, failing to assess how translations preserve or introduce security flaws. Method: We propose STED, the first security-oriented code translation evaluation framework: (1) a benchmark dataset spanning five programming languages and nine high-risk CWE vulnerability classes; (2) a dual-mode evaluation protocol combining human expert review and LLM-as-a-judge to systematically quantify functional correctness, vulnerability retention, and vulnerability introduction. Contribution/Results: We uncover widespread security degradation across mainstream LLMs—particularly for web-related vulnerabilities such as input validation flaws. Furthermore, we introduce a RAG-based mitigation strategy that reduces the new-vulnerability introduction rate by 32.8% (from 28.6%–45% to 19.2%–30.2%) across 6,000 translation instances, significantly enhancing translation security.

Technology Category

Application Category

📝 Abstract

With the growing demand for cross-language codebase migration, evaluating LLMs' security implications in translation tasks has become critical. Existing evaluations primarily focus on syntactic or functional correctness at the function level, neglecting the critical dimension of security. To enable security evaluation, we construct STED (Security-centric Translation Evaluation Dataset), the first dataset specifically designed for evaluating the security implications of LLM-based code translation. It comprises 720 security-related code samples across five programming languages and nine high-impact CWE categories, sourced from CVE/NVD and manually verified for translation tasks. Our evaluation framework consists of two independent assessment modules: (1) rigorous evaluation by security researchers, and (2) automated analysis via LLM-as-a-judge. Together they evaluate three critical aspects: functional correctness, vulnerability preservation, and vulnerability introduction rates. Our large-scale evaluation of five state-of-the-art LLMs across 6,000 translation instances reveals significant security degradation, with 28.6-45% of translations introducing new vulnerabilities--particularly for web-related flaws like input validation, where LLMs show consistent weaknesses. Furthermore, we develop a Retrieval-Augmented Generation (RAG)-based mitigation strategy that reduces translation-induced vulnerabilities by 32.8%, showing the potential of knowledge-enhanced prompting.

Problem

Research questions and friction points this paper is trying to address.

Evaluating security implications in LLM-based code translation

Assessing vulnerability introduction and preservation rates

Developing mitigation strategies for translation-induced security degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed STED dataset for security evaluation

Developed dual assessment framework with human and automated analysis

Implemented RAG-based mitigation to reduce vulnerabilities

🔎 Similar Papers

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?