🤖 AI Summary
To address the suboptimal performance of multilingual code generation under resource constraints, this paper proposes a dual-granularity Mixture-of-Experts (MoE) extension. At the token level, it introduces shared experts and gated weight normalization; at the code-segment level, it designs a sliding-window segmentation scheme coupled with a top-k active routing mechanism—jointly modeling syntactic structure and contextual patterns. The approach avoids full-parameter fine-tuning, significantly reducing computational overhead while preserving strong generative capability across mainstream programming languages. Experiments demonstrate consistent superiority over same-scale baseline models on multilingual code generation benchmarks, achieving a more favorable trade-off between performance gain and resource efficiency. This work establishes a scalable architectural paradigm for lightweight multilingual large language models for code.
📝 Abstract
Despite LLMs' excellent code creation capabilities, multilingual code generation remains extremely challenging. To address this, we intent to improve the multi-programming-lingual (MultiPL) performance of the base LLMs while retaining the most popular ones using restricted computational resources. We consider MultiPL to be a special case of multiple natural languages and propose a MultiPL extension of LLMs utilizing a hybrid mixture of experts (MoE), called MultiPL-MoE. Specifically, MultiPL-MoE combines two paired MoEs to optimize expert selection at both the token and segment levels. The token-level MoE is a standard upcycling MoE structure with a shared expert and a novel gate weight normalization approach that aids in the final fusion with the segment-level MoE. The segment-level MoE incorporates two innovative designs to better capture the syntactic structure and contextual patterns of programming languages: First, using a sliding window to partition the input token sequence into multiple segments; Then, adopting an expert-choice routing strategy that allows experts to select the top-k segments. The results of the experiment proved the effectiveness of MultiPL-MoE.