🤖 AI Summary
Evaluating the capability of large language models (LLMs) to detect and repair software defects—including syntactic/semantic errors, classic security vulnerabilities, and production-grade complex bugs—in C++ and Python code.
Method: We systematically benchmark ChatGPT-4, Claude 3, and LLaMA 4 using a multi-stage, context-aware prompting protocol; a realistic, curated dataset integrating SEED Labs, OpenSSL, and PyBugHive; and a tiered evaluation framework measuring detection accuracy, reasoning depth, and patch correctness. A local compilation-based testing pipeline and structured prompt engineering emulate authentic debugging workflows.
Contribution/Results: All models achieve high accuracy on basic errors, demonstrating utility in pedagogical support and preliminary triage. However, performance degrades significantly on complex security flaws and large-scale production code. ChatGPT-4 and Claude 3 exhibit superior contextual modeling—particularly for cross-file dependencies and subtle semantic inconsistencies—while LLaMA 4 lags in both detection robustness and repair fidelity. The study highlights critical gaps in LLM-based automated debugging and provides actionable insights for tool integration and prompt design.
📝 Abstract
Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development, supporting tasks from code generation to debugging. Yet, their real-world effectiveness in detecting diverse software bugs, particularly complex, security-relevant vulnerabilities, remains underexplored. This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of foundational programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python. The dataset integrates real code from SEED Labs, OpenSSL (via the Suresoft GLaDOS database), and PyBugHive, validated through local compilation and testing pipelines. A novel multi-stage, context-aware prompting protocol simulates realistic debugging scenarios, while a graded rubric measures detection accuracy, reasoning depth, and remediation quality. Our results show that all models excel at identifying syntactic and semantic issues in well-scoped code, making them promising for educational use and as first-pass reviewers in automated code auditing. Performance diminishes in scenarios involving complex security vulnerabilities and large-scale production code, with ChatGPT-4 and Claude 3 generally providing more nuanced contextual analyses than LLaMA 4. This highlights both the promise and the present constraints of LLMs in serving as reliable code analysis tools.