🤖 AI Summary
This work systematically evaluates large language models (LLMs) on real-world open-source code migration to Rust. Addressing the lack of rigorous I/O equivalence validation and automated repair in prior studies, we introduce FLOURINE—a toolchain that employs differential fuzz testing to automatically verify translation equivalence and implements a counterexample-driven, iterative LLM repair mechanism—eliminating the need for manually crafted test cases. Through multi-model benchmarking (GPT-4, Claude 3/2.1, Gemini Pro, Mixtral), our approach achieves up to 47% end-to-end translation success rate. We identify, for the first time, critical bottlenecks in current LLMs for practical code translation—including state consistency violations and inaccurate resource lifetime modeling—and establish an extensible methodology alongside an empirically grounded benchmark to advance translation robustness and reliability.
📝 Abstract
Large language models (LLMs) show promise in code translation - the task of translating code written in one programming language to another language - due to their ability to write code in most programming languages. However, LLM's effectiveness on translating real-world code remains largely unstudied. In this work, we perform the first substantial study on LLM-based translation to Rust by assessing the ability of five state-of-the-art LLMs, GPT4, Claude 3, Claude 2.1, Gemini Pro, and Mixtral. We conduct our study on code extracted from real-world open source projects. To enable our study, we develop FLOURINE, an end-to-end code translation tool that uses differential fuzzing to check if a Rust translation is I/O equivalent to the original source program, eliminating the need for pre-existing test cases. As part of our investigation, we assess both the LLM's ability to produce an initially successful translation, as well as their capacity to fix a previously generated buggy one. If the original and the translated programs are not I/O equivalent, we apply a set of automated feedback strategies, including feedback to the LLM with counterexamples. Our results show that the most successful LLM can translate 47% of our benchmarks, and also provides insights into next steps for improvements.