🤖 AI Summary
Existing LLM-based code translation methods suffer from two key limitations: (1) sensitivity to source-language syntax and vocabulary, leading to cross-lingual syntactic confusion; and (2) reliance on function-level parallel data, lacking statement-level semantic alignment and thus introducing semantic drift. This paper proposes TIT, a tree-structured instruction-tuning framework that innovatively integrates AST-driven syntax-agnostic parsing, statement-level fine-grained parallel data augmentation, and a two-stage instruction-tuning mechanism—thereby decoupling surface linguistic features while preserving structural consistency. Experiments demonstrate that TIT significantly improves translation accuracy across multiple mainstream LLMs, achieving 1.22–1.75× higher success rates than state-of-the-art methods and substantially reducing syntactic confusion. To our knowledge, TIT is the first approach to jointly optimize syntactic robustness and semantic fidelity in code translation.
📝 Abstract
Large Language Models (LLMs) have shown strong performance in automated source-to-target code translation through pretraining on extensive code corpora. However, mainstream LLM-based code translation methods suffer from two critical limitations. First, they are highly sensitive to language-specific features, which often introduce source-language syntax or lexicon into the output, leading to syntactic confusion. Second, they lack fine-grained semantic alignment due to an over-reliance on function-level parallel datasets, resulting in semantic misalignment between the translated code and the original source. To overcome these limitations, we propose TIT, a Tree-structured Instruction Tuning paradigm for LLM-based code translation. Specifically, TIT consists of three modules. First, to mitigate syntactic confusion, the syntactic information representation module integrates language-agnostic syntactic features via structured parsing. Then, to generate high-quality fine-grained parallel data, the fine-grained parallel dataset augmentation module aligns nodes with code segments through statement-level segmentation and contrastive matching. Finally, we leverage the dual-stage tree instruction tuning module to alleviate the contextual processing burden on the LLM caused by the introduction of syntactic information. The first stage employs syntax-aware fine-tuning to enable the LLM to autonomously comprehend structured syntactic information, while the second stage utilizes code generation fine-tuning to guide the model in generating accurate target code based on function-level syntactic dependencies. The experimental results demonstrate that the proposed method significantly outperforms existing approaches in multiple LLMs, achieving a success rate 1.22x-1.75x higher in code translation while markedly reducing syntactic confusion.