🤖 AI Summary
Code generation for low-resource programming languages—such as Prolog—remains challenging, as existing models frequently produce syntactically invalid or logically inconsistent, non-executable code. Method: This paper proposes a reasoning-augmented reinforcement learning (RL) framework that integrates Group Relative Policy Optimization (GRPO) with an explicit chain-of-reasoning feedback mechanism, establishing a novel reasoning-driven RLHF paradigm. Contribution/Results: Evaluated on mathematical logic benchmarks using the lightweight Qwen2.5-Code model, the approach significantly improves syntactic correctness, logical consistency, and executable rate of generated code. It achieves higher code accuracy and superior reasoning quality compared to strong baselines. Experimental results demonstrate that the method effectively mitigates the data scarcity bottleneck inherent in low-resource language settings and exhibits strong generalizability to other under-resourced programming languages.
📝 Abstract
Generating accurate and executable code using large language models (LLMs) is challenging for languages with limited public training data compared to popular languages such as Python. This paper introduces a generalizable approach that uses small-scale code versions of the Qwen 2.5 model combined with Group Relative Policy Optimization (GRPO) to enable effective code generation through explicit reasoning steps, which is particularly beneficial for languages with smaller source code databases. Using Prolog as a representative use case -- given its limited online presence -- the initial model faced challenges in generating executable code. After some training steps, the model successfully produces logically consistent and syntactically accurate code by directly integrating reasoning-driven feedback into the reinforcement learning loop. Experimental evaluations using mathematical logic problem benchmarks illustrate significant improvements in reasoning quality, code accuracy, and logical correctness, underscoring the potential of this approach to benefit a wide range of programming languages lacking extensive training resources.