Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

📅 2025-11-29

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the performance degradation in code translation between low-resource programming languages (e.g., Fortran) and emerging parallel frameworks (e.g., CUDA), caused by the scarcity of high-quality parallel corpora, this paper proposes a conversational, LLM-based automated data generation framework grounded in a dual-LLM question-answering mechanism. The framework employs a collaborative Questioner-Solver architecture that integrates compiler analysis, runtime execution feedback, and unit test validation to generate functionally verifiable translation pairs enriched with multi-step reasoning traces. Unlike conventional source–target code-pair paradigms, our approach significantly improves functional consistency and reliability. Evaluated on C++→CUDA translation, fine-tuning a 7B open-weight model with our generated data yields a >56% improvement in unit test pass rate; key metrics—including compilation success rate—surpass those of larger proprietary commercial systems.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran ->C++ and C++ ->CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show this data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.

Problem

Research questions and friction points this paper is trying to address.

Generates verified code translations with unit tests

Creates multi-turn dialogues capturing translation reasoning

Improves LLM performance in low-resource programming domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline with dual-LLM Questioner-Solver design

Generates verified translations with unit tests

Produces multi-turn dialogues capturing reasoning process

🔎 Similar Papers

Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation