Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing codebase translation methods rely on superficial validation and overlook semantic consistency, often producing outputs that pass syntactic tests yet violate user-intended semantic contracts. This work introduces T2J-Bench, a novel benchmark that reframes code translation as a migration task under fixed equivalence contracts, and proposes the first three-stage verification framework grounded in observational equivalence. The framework sequentially evaluates interface compliance, numerical consistency (across outputs, losses, and gradients), and training dynamics under fixed random seeds. In 355 blind trials, even the best-performing system achieved an overall pass rate of only 26.7–28.9%, while all systems exhibited self-assessed success rates inflated by 66.6–97.8 percentage points. These findings reveal a severe overestimation of performance due to misalignment between agents’ self-validation mechanisms and the underlying semantic contracts.

📝 Abstract

Coding agents increasingly act as codebase-scale collaborators that can assist with codebase conversion, but this progress has exposed a critical weakness: agents often over-trust their own local validation routines and declare success on artifacts that satisfy surface checks while violating the semantic contracts users actually care about. This problem is especially acute in codebase conversion, where prior evaluation is largely outcome-driven and therefore unstable: two implementations can match on a shallow outcome, such as a single forward loss, while diverging in gradients, optimizer behavior, or short-horizon training dynamics. We introduce T2J-Bench, a benchmark for codebase conversion that reformulates conversion as transfer under a fixed equivalence contract. A fixed verifier then compares source and converted codebases through three ordered stages: Spec (interface admissibility), Numeric (forward outputs, losses, gradients, and objective-specific tensors), and Behavioral (short training dynamics under fixed seeds). Across 355 blind conversion attempts, the best system reaches only 26.7--28.9% overall pass rate despite Spec pass rates up to 91.1%; a 4.7x token-budget spread yields only a 2.2x pass-rate spread; and all systems overestimate success by 66.6--97.8 points relative to the fixed evaluator. This suggests that failures stem more from contract-misaligned self-validation than from limited budget or backbone strength.

Problem

Research questions and friction points this paper is trying to address.

codebase conversion

observational equivalence

semantic contracts

validation failure

evaluation benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

observational equivalence

codebase conversion

T2J-Bench