🤖 AI Summary
Existing evaluations of code large language models (CodeLLMs) heavily emphasize code generation capabilities, neglecting systematic assessment of deep code understanding and reasoning. Method: We introduce CodeMMLU—the first multi-task, multilingual multiple-choice benchmark explicitly designed for code understanding—comprising nearly 20,000 expert-validated questions spanning defect detection, execution reasoning, and code repair, grounded in program analysis, compiler theory, and software testing. We formally define and quantify code understanding ability, constructing a high-quality dataset rigorously annotated by human experts and domain specialists, supporting zero-shot and few-shot evaluation. Contribution/Results: Experiments reveal that state-of-the-art CodeLLMs achieve less than 50% average accuracy on CodeMMLU—substantially below their generation performance—exposing critical deficiencies in semantic reasoning and context-sensitive analysis. CodeMMLU thus provides a rigorous, knowledge-grounded diagnostic tool to advance the evaluation and development of true code understanding capabilities.
📝 Abstract
Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a comprehensive multiple-choice benchmark designed to evaluate the depth of software and code comprehension in LLMs. CodeMMLU includes nearly 20,000 questions spanning diverse domains, including code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks that emphasize code generation, CodeMMLU assesses a model's ability to reason about programs across a wide-range of tasks such as code repair, execution reasoning, and fill-in-the-blank challenges. Our extensive evaluation reveals that even state-of-the-art models struggle with CodeMMLU, highlighting significant gaps in comprehension beyond generation. By emphasizing the essential connection between code understanding and effective AI-assisted development, CodeMMLU provides a critical resource for advancing more reliable and capable coding assistants.