Bootstrapping Code Translation with Weighted Multilanguage Exploration

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of multilingual code translation, which are primarily constrained by the scarcity of parallel data and imbalanced optimization across language pairs. The authors propose BootTrans, a novel approach that leverages test suites as verification oracles to enforce cross-lingual functional equivalence. BootTrans employs a dual-pool bootstrapping architecture—comprising a seed pool and an exploration pool—to iteratively expand high-quality training data in a guided manner. Additionally, it introduces a language-aware dynamic weighting strategy that adaptively adjusts training emphasis across different language pairs. Evaluated on the HumanEval-X and TransCoder-Test benchmarks, BootTrans substantially outperforms existing large language model baselines, achieving consistent and significant performance gains across all translation directions.

Technology Category

Application Category

📝 Abstract
Code translation across multiple programming languages is essential yet challenging due to two vital obstacles: scarcity of parallel data paired with executable test oracles, and optimization imbalance when handling diverse language pairs. We propose BootTrans, a bootstrapping method that resolves both obstacles. Its key idea is to leverage the functional invariance and cross-lingual portability of test suites, adapting abundant pivot-language unit tests to serve as universal verification oracles for multilingual RL training. Our method introduces a dual-pool architecture with seed and exploration pools to progressively expand training data via execution-guided experience collection. Furthermore, we design a language-aware weighting mechanism that dynamically prioritizes harder translation directions based on relative performance across sibling languages, mitigating optimization imbalance. Extensive experiments on the HumanEval-X and TransCoder-Test benchmarks demonstrate substantial improvements over baseline LLMs across all translation directions, with ablations validating the effectiveness of both bootstrapping and weighting components.
Problem

Research questions and friction points this paper is trying to address.

code translation
parallel data scarcity
optimization imbalance
multilingual programming
test oracles
Innovation

Methods, ideas, or system contributions that make the work stand out.

bootstrapping
multilingual code translation
test oracle adaptation
language-aware weighting
execution-guided data expansion
🔎 Similar Papers
No similar papers found.
Yuhan Wu
Yuhan Wu
Peking University, Ph.D. student in CS, yuhan.wu [at] pku.edu.cn My Chinese name is 吴钰晗
Data StructuresNetworkingBig Data
H
Huan Zhang
State Key Laboratory for Novel Software Technology, Nanjing University, China
W
Wei Cheng
State Key Laboratory for Novel Software Technology, Nanjing University, China
C
Chen Shen
State Key Laboratory for Novel Software Technology, Nanjing University, China
J
Jingyue Yang
State Key Laboratory for Novel Software Technology, Nanjing University, China
Wei Hu
Wei Hu
Nanjing University
Knowledge GraphDatabaseNLPDigital Health