CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code benchmarks predominantly target single tasks, failing to comprehensively evaluate large language models’ holistic capabilities in realistic software engineering scenarios. To address this gap, we propose CodeBench—the first multilingual, multi-difficulty benchmark systematically covering four core dimensions: code understanding, generation, modification, and review. Our methodology features: (1) the first principled integration of essential developer competencies; (2) high-quality, human-annotated data construction; and (3) a cross-task consistent evaluation framework. Extensive experiments demonstrate that CodeBench precisely identifies capability bottlenecks across state-of-the-art code LMs, significantly discriminates model performance across diverse subtasks, and exhibits strong reliability and scalability. As a result, CodeBench establishes a reproducible, extensible, and empirically grounded standard for comprehensive code model evaluation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive evaluation framework that reflects real-world applications. To address these gaps, we introduce CoCo-Bench (Comprehensive Code Benchmark), designed to evaluate LLMs across four critical dimensions: code understanding, code generation, code modification, and code review. These dimensions capture essential developer needs, ensuring a more systematic and representative evaluation. CoCo-Bench includes multiple programming languages and varying task difficulties, with rigorous manual review to ensure data quality and accuracy. Empirical results show that CoCo-Bench aligns with existing benchmarks while uncovering significant variations in model performance, effectively highlighting strengths and weaknesses. By offering a holistic and objective evaluation, CoCo-Bench provides valuable insights to guide future research and technological advancements in code-oriented LLMs, establishing a reliable benchmark for the field.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lack comprehensive evaluation for real-world code tasks
No unified framework assesses code understanding, generation, modification, and review
Current benchmarks are limited in programming languages and task diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-dimensional evaluation framework for LLMs
Includes multiple programming languages and difficulties
Rigorous manual review ensures data quality
🔎 Similar Papers
No similar papers found.
W
Wenjing Yin
Peking University
T
Tianze Sun
Harbin Institute of Technology
Yijiong Yu
Yijiong Yu
Master Student, Tsinghua University
Natural Language ProcessingMachine Learning
J
Jiawei Fang
China Unicom Software Research Institute
G
Guangyao Su
China Unicom Software Research Institute
J
Jiancheng Wang
China Unicom Software Research Institute
Z
Zekun Wang
OpenCSG
W
Wei Wang
OpenCSG
R
Ran Chen
OpenCSG
Z
Ziyun Dai
OpenCSG
S
Shuai Yuan
Peking University
M
Menghang Dong
Peking University
Peng Luo
Peng Luo
MIT
Spatial Data ScienceSpatial StatisticsSpatial AnalysisGeoAIGIScience
Dong Cao
Dong Cao
OpenCSG
Da Lei
Da Lei
The Hong Kong Polytechnic University
Traffic Big DataDeep Learning
Y
Yajun Zhang
OpenCSG
H
Hao Chen
OpenCSG
Xiang Ma
Xiang Ma
Assistant Professor, University of Wisconsin-Eau Claire
Federated learningSignal ProcessingNOMA
Y
Yong Liu
OpenCSG
Weifeng Liu
Weifeng Liu
University of Florida
Machine LearningSignal ProcessingKernel adaptive filtering
Y
Yuanjian Xu
Hong Kong University of Science and Technology (Guangzhou)
J
Jingfei Pei
OpenCSG