TIT: A Tree-Structured Instruction Tuning Approach for LLM-Based Code Translation

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing LLM-based code translation methods suffer from two key limitations: (1) sensitivity to source-language syntax and vocabulary, leading to cross-lingual syntactic confusion; and (2) reliance on function-level parallel data, lacking statement-level semantic alignment and thus introducing semantic drift. This paper proposes TIT, a tree-structured instruction-tuning framework that innovatively integrates AST-driven syntax-agnostic parsing, statement-level fine-grained parallel data augmentation, and a two-stage instruction-tuning mechanism—thereby decoupling surface linguistic features while preserving structural consistency. Experiments demonstrate that TIT significantly improves translation accuracy across multiple mainstream LLMs, achieving 1.22–1.75× higher success rates than state-of-the-art methods and substantially reducing syntactic confusion. To our knowledge, TIT is the first approach to jointly optimize syntactic robustness and semantic fidelity in code translation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown strong performance in automated source-to-target code translation through pretraining on extensive code corpora. However, mainstream LLM-based code translation methods suffer from two critical limitations. First, they are highly sensitive to language-specific features, which often introduce source-language syntax or lexicon into the output, leading to syntactic confusion. Second, they lack fine-grained semantic alignment due to an over-reliance on function-level parallel datasets, resulting in semantic misalignment between the translated code and the original source. To overcome these limitations, we propose TIT, a Tree-structured Instruction Tuning paradigm for LLM-based code translation. Specifically, TIT consists of three modules. First, to mitigate syntactic confusion, the syntactic information representation module integrates language-agnostic syntactic features via structured parsing. Then, to generate high-quality fine-grained parallel data, the fine-grained parallel dataset augmentation module aligns nodes with code segments through statement-level segmentation and contrastive matching. Finally, we leverage the dual-stage tree instruction tuning module to alleviate the contextual processing burden on the LLM caused by the introduction of syntactic information. The first stage employs syntax-aware fine-tuning to enable the LLM to autonomously comprehend structured syntactic information, while the second stage utilizes code generation fine-tuning to guide the model in generating accurate target code based on function-level syntactic dependencies. The experimental results demonstrate that the proposed method significantly outperforms existing approaches in multiple LLMs, achieving a success rate 1.22x-1.75x higher in code translation while markedly reducing syntactic confusion.

Problem

Research questions and friction points this paper is trying to address.

Addresses syntactic confusion in code translation by integrating language-agnostic features

Improves semantic alignment through fine-grained parallel dataset augmentation

Reduces contextual processing burden with dual-stage tree instruction tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-structured instruction tuning for code translation

Syntax-aware fine-tuning with structured parsing

Statement-level contrastive matching for semantic alignment

🔎 Similar Papers

No similar papers found.