TIT: A Tree-Structured Instruction Tuning Approach for LLM-Based Code Translation

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based code translation methods suffer from two key limitations: (1) sensitivity to source-language syntax and vocabulary, leading to cross-lingual syntactic confusion; and (2) reliance on function-level parallel data, lacking statement-level semantic alignment and thus introducing semantic drift. This paper proposes TIT, a tree-structured instruction-tuning framework that innovatively integrates AST-driven syntax-agnostic parsing, statement-level fine-grained parallel data augmentation, and a two-stage instruction-tuning mechanism—thereby decoupling surface linguistic features while preserving structural consistency. Experiments demonstrate that TIT significantly improves translation accuracy across multiple mainstream LLMs, achieving 1.22–1.75× higher success rates than state-of-the-art methods and substantially reducing syntactic confusion. To our knowledge, TIT is the first approach to jointly optimize syntactic robustness and semantic fidelity in code translation.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown strong performance in automated source-to-target code translation through pretraining on extensive code corpora. However, mainstream LLM-based code translation methods suffer from two critical limitations. First, they are highly sensitive to language-specific features, which often introduce source-language syntax or lexicon into the output, leading to syntactic confusion. Second, they lack fine-grained semantic alignment due to an over-reliance on function-level parallel datasets, resulting in semantic misalignment between the translated code and the original source. To overcome these limitations, we propose TIT, a Tree-structured Instruction Tuning paradigm for LLM-based code translation. Specifically, TIT consists of three modules. First, to mitigate syntactic confusion, the syntactic information representation module integrates language-agnostic syntactic features via structured parsing. Then, to generate high-quality fine-grained parallel data, the fine-grained parallel dataset augmentation module aligns nodes with code segments through statement-level segmentation and contrastive matching. Finally, we leverage the dual-stage tree instruction tuning module to alleviate the contextual processing burden on the LLM caused by the introduction of syntactic information. The first stage employs syntax-aware fine-tuning to enable the LLM to autonomously comprehend structured syntactic information, while the second stage utilizes code generation fine-tuning to guide the model in generating accurate target code based on function-level syntactic dependencies. The experimental results demonstrate that the proposed method significantly outperforms existing approaches in multiple LLMs, achieving a success rate 1.22x-1.75x higher in code translation while markedly reducing syntactic confusion.
Problem

Research questions and friction points this paper is trying to address.

Addresses syntactic confusion in code translation by integrating language-agnostic features
Improves semantic alignment through fine-grained parallel dataset augmentation
Reduces contextual processing burden with dual-stage tree instruction tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-structured instruction tuning for code translation
Syntax-aware fine-tuning with structured parsing
Statement-level contrastive matching for semantic alignment
🔎 Similar Papers
No similar papers found.
H
He Jiang
School of Software, Dalian University of Technology, Dalian 116024, China
Yufu Wang
Yufu Wang
University of Pennsylvania
Computer VisionMachine Learning
H
Hao Lin
School of Software, Dalian University of Technology, Dalian 116024, China
P
Peiyu Zou
School of Computer Science and Artificial Intelligence, Liaoning Normal University, Dalian 116029, China
Z
Zhide Zhou
School of Software, Dalian University of Technology, Dalian 116024, China
Ang Jia
Ang Jia
Xi'an Jiaotong University
binary similarity
X
Xiaochen Li
School of Software, Dalian University of Technology, Dalian 116024, China
Zhilei Ren
Zhilei Ren
Dalian University of Technology
Software EngineeringTestingProfilingEvolutionary Computation