🤖 AI Summary
This work addresses the challenge that reinforcement learning (RL) for code generation often fails to generalize zero-shot to unseen programming languages, sometimes even degrading performance. To mitigate this, the authors propose Parallel-SFT, a method that incorporates functionally equivalent multilingual parallel programs during supervised fine-tuning (SFT) to produce a more generalizable initialization for subsequent RL training. This approach represents the first integration of parallel programs into the SFT phase and substantially improves zero-shot transfer performance on target languages. The study further reveals that functional alignment in the latent representation space is crucial for cross-lingual generalization: models initialized with Parallel-SFT not only achieve superior performance on unseen languages after RL but also exhibit tighter clustering of cross-lingual equivalent programs in their internal representations.
📝 Abstract
Modern language models demonstrate impressive coding capabilities in common programming languages (PLs), such as C++ and Python, but their performance in lower-resource PLs is often limited by training data availability. In principle, however, most programming skills are universal across PLs, so the capability acquired in one PL should transfer to others. In this work, we propose the task of zero-shot cross-programming-language transfer for code RL. We find that, for Llama-3.1, RL training for code generation in a source PL fails to improve, and sometimes even degrades, the performance on other target PLs. To address this, we hypothesize that effective RL transfer requires a generalizable SFT initialization before RL. We thus propose **Parallel-SFT**, an SFT strategy that incorporates "parallel programs" -- functionally equivalent code implemented in multiple PLs -- into the data mixture. We demonstrate that this improves transferability: when we subsequently perform RL on our Parallel-SFT model, we observe better generalization to unseen PLs. Analysis of the model internal representations reveals that Parallel-SFT leads to a more functionality-centric latent space, where equivalent programs across PLs are more tightly clustered, which we hypothesize to contribute to the improved transferability.