From Reasoning to Code: GRPO Optimization for Underrepresented Languages

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Code generation for low-resource programming languages—such as Prolog—remains challenging, as existing models frequently produce syntactically invalid or logically inconsistent, non-executable code. Method: This paper proposes a reasoning-augmented reinforcement learning (RL) framework that integrates Group Relative Policy Optimization (GRPO) with an explicit chain-of-reasoning feedback mechanism, establishing a novel reasoning-driven RLHF paradigm. Contribution/Results: Evaluated on mathematical logic benchmarks using the lightweight Qwen2.5-Code model, the approach significantly improves syntactic correctness, logical consistency, and executable rate of generated code. It achieves higher code accuracy and superior reasoning quality compared to strong baselines. Experimental results demonstrate that the method effectively mitigates the data scarcity bottleneck inherent in low-resource language settings and exhibits strong generalizability to other under-resourced programming languages.

Technology Category

Application Category

📝 Abstract

Generating accurate and executable code using large language models (LLMs) is challenging for languages with limited public training data compared to popular languages such as Python. This paper introduces a generalizable approach that uses small-scale code versions of the Qwen 2.5 model combined with Group Relative Policy Optimization (GRPO) to enable effective code generation through explicit reasoning steps, which is particularly beneficial for languages with smaller source code databases. Using Prolog as a representative use case -- given its limited online presence -- the initial model faced challenges in generating executable code. After some training steps, the model successfully produces logically consistent and syntactically accurate code by directly integrating reasoning-driven feedback into the reinforcement learning loop. Experimental evaluations using mathematical logic problem benchmarks illustrate significant improvements in reasoning quality, code accuracy, and logical correctness, underscoring the potential of this approach to benefit a wide range of programming languages lacking extensive training resources.

Problem

Research questions and friction points this paper is trying to address.

Generating executable code for underrepresented languages using LLMs

Improving code accuracy via reasoning-driven feedback in reinforcement learning

Enhancing logical correctness for languages with limited training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines Qwen 2.5 with GRPO optimization

Integrates reasoning feedback into RL loop

Improves code accuracy for low-resource languages

🔎 Similar Papers

No similar papers found.