CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 1

career value

210K/year

🤖 AI Summary

Existing evaluations of code large language models (CodeLLMs) heavily emphasize code generation capabilities, neglecting systematic assessment of deep code understanding and reasoning. Method: We introduce CodeMMLU—the first multi-task, multilingual multiple-choice benchmark explicitly designed for code understanding—comprising nearly 20,000 expert-validated questions spanning defect detection, execution reasoning, and code repair, grounded in program analysis, compiler theory, and software testing. We formally define and quantify code understanding ability, constructing a high-quality dataset rigorously annotated by human experts and domain specialists, supporting zero-shot and few-shot evaluation. Contribution/Results: Experiments reveal that state-of-the-art CodeLLMs achieve less than 50% average accuracy on CodeMMLU—substantially below their generation performance—exposing critical deficiencies in semantic reasoning and context-sensitive analysis. CodeMMLU thus provides a rigorous, knowledge-grounded diagnostic tool to advance the evaluation and development of true code understanding capabilities.

Technology Category

Application Category

📝 Abstract

Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a comprehensive multiple-choice benchmark designed to evaluate the depth of software and code comprehension in LLMs. CodeMMLU includes nearly 20,000 questions spanning diverse domains, including code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks that emphasize code generation, CodeMMLU assesses a model's ability to reason about programs across a wide-range of tasks such as code repair, execution reasoning, and fill-in-the-blank challenges. Our extensive evaluation reveals that even state-of-the-art models struggle with CodeMMLU, highlighting significant gaps in comprehension beyond generation. By emphasizing the essential connection between code understanding and effective AI-assisted development, CodeMMLU provides a critical resource for advancing more reliable and capable coding assistants.

Problem

Research questions and friction points this paper is trying to address.

Assessing code understanding in CodeLLMs beyond generation

Evaluating reasoning across code analysis and defect detection

Identifying gaps in software comprehension for AI coding assistants

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CodeMMLU for code understanding evaluation

Includes 20,000 diverse code-related multiple-choice questions

Assesses reasoning across code repair and execution tasks

🔎 Similar Papers

No similar papers found.