🤖 AI Summary
Existing C-to-Rust translation lacks benchmarks that comprehensively assess memory safety and end-to-end functional correctness—hindering the modernization of legacy C code. This paper introduces CRUST-Bench, the first holistic benchmark for C→Safe Rust translation: it comprises 100 real-world C repositories, each accompanied by manually authored, memory-safe Rust interfaces (serving as formal specifications) and cross-file, dependency-aware end-to-end test suites. We establish the first repository-level evaluation paradigm for C-to-Safe Rust translation. Integrating an LLM-based translation evaluation framework with systematic error pattern analysis, our experiments reveal that even the state-of-the-art model (OpenAI o1) succeeds on only 15 out of 100 tasks in single-shot generation. The study uncovers fundamental limitations in current approaches—including inaccurate ownership inference, inadequate lifetime modeling, and insufficient handling of external dependencies—highlighting critical gaps in automated safe systems programming translation.
📝 Abstract
C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.