Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit persistent failure patterns in code generation, yet systematic attribution of these failures across benchmarks remains lacking. Method: We propose the first cross-benchmark weakness analysis framework, evaluating 114 high-difficulty tasks from HumanEval, MBPP, APPS, and CodeContests via integrated cross-benchmark comparison, static complexity quantification, and fine-grained human annotation. Contribution/Results: We identify four recurring LLM weaknesses: sensitivity to static structural complexity, breakdowns in multi-step logical reasoning, inadequate modeling of implicit constraints, and failure to generalize across boundary conditions. Additionally, we uncover common evaluation confounders—including redundant problem descriptions and unconventional API usage—and expose systemic biases in current benchmarks. Our findings provide reproducible empirical evidence and concrete, actionable directions for model capability diagnosis, training data curation, and evaluation methodology refinement.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering quantitative rankings of LLMs. However, they provide limited insight into the tasks that LLMs consistently fail to solve - information that is crucial for understanding current limitations and guiding the development of more capable models. To address this gap, we examined code generation tasks across four popular benchmarks, identifying those that major LLMs are most likely to fail. To understand the causes of these failures, we investigated whether the static complexity of solution code contributes to them, followed by a systematic inspection of 114 tasks that LLMs consistently struggled with. Our analysis revealed four recurring patterns of weaknesses in LLMs, as well as common complications within benchmark tasks that most often lead to failure.
Problem

Research questions and friction points this paper is trying to address.

Identifying code generation tasks where LLMs consistently fail
Analyzing static code complexity as a failure cause
Discovering recurring weakness patterns in LLM performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed code generation tasks across four benchmarks
Identified static complexity as a failure factor
Systematically inspected 114 consistently failed tasks
🔎 Similar Papers
No similar papers found.
A
Amir Molzam Sharifloo
Technische Universität Darmstadt, Germany
M
Maedeh Heydari
Technische Universität Darmstadt, Germany
P
Parsa Kazerooni
Technische Universität Darmstadt, Germany
D
Daniel Maninger
Technische Universität Darmstadt, Germany; The Hessian Center for Artificial Intelligence (hessian.AI), Germany
Mira Mezini
Mira Mezini
Professor of Computer Science, TU Darmstadt, Germany
Programming LanguagesSoftware EngineeringProgram AnalysisSoftware SecurityReactive Programming