Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

📅 2025-11-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance degradation in code translation between low-resource programming languages (e.g., Fortran) and emerging parallel frameworks (e.g., CUDA), caused by the scarcity of high-quality parallel corpora, this paper proposes a conversational, LLM-based automated data generation framework grounded in a dual-LLM question-answering mechanism. The framework employs a collaborative Questioner-Solver architecture that integrates compiler analysis, runtime execution feedback, and unit test validation to generate functionally verifiable translation pairs enriched with multi-step reasoning traces. Unlike conventional source–target code-pair paradigms, our approach significantly improves functional consistency and reliability. Evaluated on C++→CUDA translation, fine-tuning a 7B open-weight model with our generated data yields a >56% improvement in unit test pass rate; key metrics—including compilation success rate—surpass those of larger proprietary commercial systems.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran ->C++ and C++ ->CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show this data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.
Problem

Research questions and friction points this paper is trying to address.

Generates verified code translations with unit tests
Creates multi-turn dialogues capturing translation reasoning
Improves LLM performance in low-resource programming domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline with dual-LLM Questioner-Solver design
Generates verified translations with unit tests
Produces multi-turn dialogues capturing reasoning process
🔎 Similar Papers
No similar papers found.