From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge of maintaining functional consistency and enabling continuous evolution during the cross-language migration of large-scale, production-grade AI agent systems from Rust to Python. The authors propose a large language model–assisted, benchmark-driven migration paradigm that leverages public agent benchmarks as optimization targets. Through an iterative diff-translate-test loop and a multi-agent architecture governed by feature flags, they successfully migrate a 648K LOC Rust codebase to a 41K LOC Python implementation—reducing code volume by 15.9×. The resulting system not only preserves strict behavioral alignment with the original but also surpasses it on SWE-bench Verified (73.8% vs. 70.0%) and introduces 30 new capabilities absent in the original, such as semantic memory and safety guardians, thereby achieving a functional superset rather than a mere language port.

Technology Category

Application Category

📝 Abstract

Cross-language migration of large software systems is a persistent engineering challenge, particularly when the source codebase evolves rapidly. We present a methodology for LLM-assisted continuous code translation in which a large language model translates a production Rust codebase (648K LOC, 65 crates) into Python (41K LOC, 28 modules), with public agent benchmarks as the objective function driving iterative refinement. Our subject system is Codex CLI, a production AI coding agent. We demonstrate that: (1) the Python port resolves 59/80 SWE-bench Verified tasks (73.8%) versus Rust's 56/80 (70.0%), and achieves 42.5% on Terminal-Bench versus Rust's 47.5%, confirming near-parity on real-world agentic tasks; (2) benchmark-driven debugging, revealing API protocol mismatches, environment pollution, a silent WebSocket failure mode, and an API 400 crash, is more effective than static testing alone; (3) the architecture supports continuous upstream synchronisation via an LLM-assisted diff-translate-test loop; and (4) the Python port has evolved into a capability superset with 30 feature-flagged extensions (multi-agent orchestration, semantic memory, guardian safety, cost tracking) absent from Rust, while preserving strict parity mode for comparison. Our evaluation shows that for LLM-based agents where API latency dominates, Python's expressiveness yields a 15.9x code reduction with negligible performance cost, while the benchmark-as-objective-function methodology provides a principled framework for growing a cross-language port from parity into an extended platform.

Problem

Research questions and friction points this paper is trying to address.

cross-language migration

code translation

benchmark-driven evolution

production AI agent

language interoperability

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-assisted translation

benchmark-driven evolution

cross-language migration