M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation

πŸ“… 2025-12-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing code LLM evaluation benchmarks suffer from coarse-grained assessment and limited language coverage, failing to capture fine-grained cross-lingual capability disparities. To address this, we propose M2G-Eval-Coderβ€”the first code generation evaluation framework supporting four granularities (class, function, code block, and line) across 18 programming languages, comprising over 17K training tasks and 1,286 contamination-controlled, human-annotated test samples. We introduce a novel multi-granularity + multilingual co-evaluation paradigm, systematically benchmarking 30 models. Our analysis reveals an ascending difficulty trend from line- to class-level generation, distinct performance patterns between full- and partial-language-support models, and measurable cross-lingual conceptual transferability; we further confirm strong cross-lingual performance correlation. Leveraging supervised fine-tuning (SFT) and GRPO optimization, our approach achieves significant gains over baselines across all granularities.

Technology Category

Application Category

πŸ“ Abstract
The rapid advancement of code large language models (LLMs) has sparked significant research interest in systematically evaluating their code generation capabilities, yet existing benchmarks predominantly assess models at a single structural granularity and focus on limited programming languages, obscuring fine-grained capability variations across different code scopes and multilingual scenarios. We introduce M2G-Eval, a multi-granularity, multilingual framework for evaluating code generation in large language models (LLMs) across four levels: Class, Function, Block, and Line. Spanning 18 programming languages, M2G-Eval includes 17K+ training tasks and 1,286 human-annotated, contamination-controlled test instances. We develop M2G-Eval-Coder models by training Qwen3-8B with supervised fine-tuning and Group Relative Policy Optimization. Evaluating 30 models (28 state-of-the-art LLMs plus our two M2G-Eval-Coder variants) reveals three main findings: (1) an apparent difficulty hierarchy, with Line-level tasks easiest and Class-level most challenging; (2) widening performance gaps between full- and partial-granularity languages as task complexity increases; and (3) strong cross-language correlations, suggesting that models learn transferable programming concepts. M2G-Eval enables fine-grained diagnosis of code generation capabilities and highlights persistent challenges in synthesizing complex, long-form code.
Problem

Research questions and friction points this paper is trying to address.

Evaluates code generation across multiple structural granularities
Assesses multilingual capabilities across 18 programming languages
Diagnoses fine-grained model performance and transferable programming concepts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-granularity multilingual framework for code evaluation
Training with supervised fine-tuning and Group Relative Policy Optimization
Contamination-controlled test instances across 18 programming languages
F
Fanglin Xu
Beihang University
W
Wei Zhang
Beihang University
J
Jian Yang
Beihang University
G
Guo Chen
Hunan University
A
Aishan Liu
Beihang University
Zhoujun Li
Zhoujun Li
Beihang University
Artificial IntelligentNatural Language ProcessingNetwork Security
X
Xianglong Liu
Beihang University
B
Bryan Dai
Ubiquant