mHumanEval - A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

📅 2024-10-19

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing code generation benchmarks (e.g., HumanEval) are English-centric, limited in task diversity, and lack broad linguistic coverage, hindering accurate evaluation of large language models’ multilingual code generation capabilities. Method: We introduce ML-HumanEval—the first multilingual code generation benchmark supporting 200+ natural languages—featuring 15 expert-translated language versions. Our methodology combines high-quality machine translation with bilingual expert verification to ensure strict semantic alignment, while retaining HumanEval’s test infrastructure and the standard pass@k evaluation protocol. Contribution/Results: Experiments reveal substantial performance degradation of state-of-the-art code LMs on low-resource languages, highlighting critical gaps in current multilingual generalization. ML-HumanEval is fully open-sourced and reproducible, establishing a new standard for rigorous, cross-lingual code generation evaluation.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality assurance process. Furthermore, we provide expert human translations for 15 diverse natural languages (NLs). We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs, offering insights into the current landscape of cross-lingual code generation.

Problem

Research questions and friction points this paper is trying to address.

Code Generation

Cross-Language Evaluation

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

mHumanEval

Multilingual Code Generation

LLM Evaluation Benchmark

🔎 Similar Papers

A Survey on Evaluating Large Language Models in Code Generation Tasks