🤖 AI Summary
Large language models (LLMs) exhibit persistent failure patterns in code generation, yet systematic attribution of these failures across benchmarks remains lacking.
Method: We propose the first cross-benchmark weakness analysis framework, evaluating 114 high-difficulty tasks from HumanEval, MBPP, APPS, and CodeContests via integrated cross-benchmark comparison, static complexity quantification, and fine-grained human annotation.
Contribution/Results: We identify four recurring LLM weaknesses: sensitivity to static structural complexity, breakdowns in multi-step logical reasoning, inadequate modeling of implicit constraints, and failure to generalize across boundary conditions. Additionally, we uncover common evaluation confounders—including redundant problem descriptions and unconventional API usage—and expose systemic biases in current benchmarks. Our findings provide reproducible empirical evidence and concrete, actionable directions for model capability diagnosis, training data curation, and evaluation methodology refinement.
📝 Abstract
Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering quantitative rankings of LLMs. However, they provide limited insight into the tasks that LLMs consistently fail to solve - information that is crucial for understanding current limitations and guiding the development of more capable models. To address this gap, we examined code generation tasks across four popular benchmarks, identifying those that major LLMs are most likely to fail. To understand the causes of these failures, we investigated whether the static complexity of solution code contributes to them, followed by a systematic inspection of 114 tasks that LLMs consistently struggled with. Our analysis revealed four recurring patterns of weaknesses in LLMs, as well as common complications within benchmark tasks that most often lead to failure.