Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Large language models (LLMs) exhibit persistent failure patterns in code generation, yet systematic attribution of these failures across benchmarks remains lacking. Method: We propose the first cross-benchmark weakness analysis framework, evaluating 114 high-difficulty tasks from HumanEval, MBPP, APPS, and CodeContests via integrated cross-benchmark comparison, static complexity quantification, and fine-grained human annotation. Contribution/Results: We identify four recurring LLM weaknesses: sensitivity to static structural complexity, breakdowns in multi-step logical reasoning, inadequate modeling of implicit constraints, and failure to generalize across boundary conditions. Additionally, we uncover common evaluation confounders—including redundant problem descriptions and unconventional API usage—and expose systemic biases in current benchmarks. Our findings provide reproducible empirical evidence and concrete, actionable directions for model capability diagnosis, training data curation, and evaluation methodology refinement.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering quantitative rankings of LLMs. However, they provide limited insight into the tasks that LLMs consistently fail to solve - information that is crucial for understanding current limitations and guiding the development of more capable models. To address this gap, we examined code generation tasks across four popular benchmarks, identifying those that major LLMs are most likely to fail. To understand the causes of these failures, we investigated whether the static complexity of solution code contributes to them, followed by a systematic inspection of 114 tasks that LLMs consistently struggled with. Our analysis revealed four recurring patterns of weaknesses in LLMs, as well as common complications within benchmark tasks that most often lead to failure.

Problem

Research questions and friction points this paper is trying to address.

Identifying code generation tasks where LLMs consistently fail

Analyzing static code complexity as a failure cause

Discovering recurring weakness patterns in LLM performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed code generation tasks across four benchmarks

Identified static complexity as a failure factor

Systematically inspected 114 consistently failed tasks

🔎 Similar Papers

A Survey on Evaluating Large Language Models in Code Generation Tasks