On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This paper addresses the uneven code understanding and generation capabilities of large language models (LLMs) across mainstream (e.g., Python, C++) versus niche programming languages (e.g., C#, Rust, Go). To bridge this gap, we propose a code-translation-driven cross-lingual capability transfer method. Our approach introduces OORL, a novel hybrid on-policy/off-policy reinforcement learning framework, and—first in the literature—Group Equivalent Preference Optimization (GEPO). GEPO leverages functionally equivalent intermediate representations (IRs) to group semantically aligned code snippets across languages, enabling joint modeling of cross-lingual semantic consistency. It integrates unit-test-driven rule-based rewards with equivalence-aware preference learning. Evaluated on multilingual code benchmarks, our method significantly improves performance on niche languages while preserving accuracy on mainstream ones. Results demonstrate that GEPO’s IR-grouped preference optimization effectively enhances cross-lingual generalization, offering a principled solution to language-imbalanced code intelligence.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) achieve remarkable performance in code generation tasks. However, a significant performance disparity persists between popular programming languages (e.g., Python, C++) and others. To address this capability gap, we leverage the code translation task to train LLMs, thereby facilitating the transfer of coding proficiency across diverse programming languages. Moreover, we introduce OORL for training, a novel reinforcement learning (RL) framework that integrates on-policy and off-policy strategies. Within OORL, on-policy RL is applied during code translation, guided by a rule-based reward signal derived from unit tests. Complementing this coarse-grained rule-based reward, we propose Group Equivalent Preference Optimization (GEPO), a novel preference optimization method. Specifically, GEPO trains the LLM using intermediate representations (IRs) groups. LLMs can be guided to discern IRs equivalent to the source code from inequivalent ones, while also utilizing signals about the mutual equivalence between IRs within the group. This process allows LLMs to capture nuanced aspects of code functionality. By employing OORL for training with code translation tasks, LLMs improve their recognition of code functionality and their understanding of the relationships between code implemented in different languages. Extensive experiments demonstrate that our OORL for LLMs training with code translation tasks achieves significant performance improvements on code benchmarks across multiple programming languages.

Problem

Research questions and friction points this paper is trying to address.

Reduce performance gap between popular and less common programming languages in LLMs

Train LLMs using code translation to transfer coding proficiency across languages

Improve code functionality recognition via Group Equivalent Preference Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses code translation to transfer coding proficiency

Integrates on-policy and off-policy RL (OORL)

Proposes Group Equivalent Preference Optimization (GEPO)

🔎 Similar Papers

No similar papers found.