🤖 AI Summary
This work addresses the challenge of maintaining functional consistency and enabling continuous evolution during the cross-language migration of large-scale, production-grade AI agent systems from Rust to Python. The authors propose a large language model–assisted, benchmark-driven migration paradigm that leverages public agent benchmarks as optimization targets. Through an iterative diff-translate-test loop and a multi-agent architecture governed by feature flags, they successfully migrate a 648K LOC Rust codebase to a 41K LOC Python implementation—reducing code volume by 15.9×. The resulting system not only preserves strict behavioral alignment with the original but also surpasses it on SWE-bench Verified (73.8% vs. 70.0%) and introduces 30 new capabilities absent in the original, such as semantic memory and safety guardians, thereby achieving a functional superset rather than a mere language port.
📝 Abstract
Cross-language migration of large software systems is a persistent engineering challenge, particularly when the source codebase evolves rapidly. We present a methodology for LLM-assisted continuous code translation in which a large language model translates a production Rust codebase (648K LOC, 65 crates) into Python (41K LOC, 28 modules), with public agent benchmarks as the objective function driving iterative refinement. Our subject system is Codex CLI, a production AI coding agent. We demonstrate that: (1) the Python port resolves 59/80 SWE-bench Verified tasks (73.8%) versus Rust's 56/80 (70.0%), and achieves 42.5% on Terminal-Bench versus Rust's 47.5%, confirming near-parity on real-world agentic tasks; (2) benchmark-driven debugging, revealing API protocol mismatches, environment pollution, a silent WebSocket failure mode, and an API 400 crash, is more effective than static testing alone; (3) the architecture supports continuous upstream synchronisation via an LLM-assisted diff-translate-test loop; and (4) the Python port has evolved into a capability superset with 30 feature-flagged extensions (multi-agent orchestration, semantic memory, guardian safety, cost tracking) absent from Rust, while preserving strict parity mode for comparison. Our evaluation shows that for LLM-based agents where API latency dominates, Python's expressiveness yields a 15.9x code reduction with negligible performance cost, while the benchmark-as-objective-function methodology provides a principled framework for growing a cross-language port from parity into an extended platform.