Towards Translating Real-World Code with LLMs: A Study of Translating to Rust

📅 2024-05-19
🏛️ arXiv.org
📈 Citations: 14
Influential: 1
📄 PDF
🤖 AI Summary
This work systematically evaluates large language models (LLMs) on real-world open-source code migration to Rust. Addressing the lack of rigorous I/O equivalence validation and automated repair in prior studies, we introduce FLOURINE—a toolchain that employs differential fuzz testing to automatically verify translation equivalence and implements a counterexample-driven, iterative LLM repair mechanism—eliminating the need for manually crafted test cases. Through multi-model benchmarking (GPT-4, Claude 3/2.1, Gemini Pro, Mixtral), our approach achieves up to 47% end-to-end translation success rate. We identify, for the first time, critical bottlenecks in current LLMs for practical code translation—including state consistency violations and inaccurate resource lifetime modeling—and establish an extensible methodology alongside an empirically grounded benchmark to advance translation robustness and reliability.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) show promise in code translation - the task of translating code written in one programming language to another language - due to their ability to write code in most programming languages. However, LLM's effectiveness on translating real-world code remains largely unstudied. In this work, we perform the first substantial study on LLM-based translation to Rust by assessing the ability of five state-of-the-art LLMs, GPT4, Claude 3, Claude 2.1, Gemini Pro, and Mixtral. We conduct our study on code extracted from real-world open source projects. To enable our study, we develop FLOURINE, an end-to-end code translation tool that uses differential fuzzing to check if a Rust translation is I/O equivalent to the original source program, eliminating the need for pre-existing test cases. As part of our investigation, we assess both the LLM's ability to produce an initially successful translation, as well as their capacity to fix a previously generated buggy one. If the original and the translated programs are not I/O equivalent, we apply a set of automated feedback strategies, including feedback to the LLM with counterexamples. Our results show that the most successful LLM can translate 47% of our benchmarks, and also provides insights into next steps for improvements.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to translate real-world code to Rust
Developing FLOURINE for I/O equivalence verification without test cases
Evaluating LLMs' success rates and bug-fixing capabilities in translation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs for Rust code translation
Develops FLOURINE for I/O equivalence checking
Applies automated feedback with counterexamples