LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating the capability of large language models (LLMs) to detect and repair software defects—including syntactic/semantic errors, classic security vulnerabilities, and production-grade complex bugs—in C++ and Python code. Method: We systematically benchmark ChatGPT-4, Claude 3, and LLaMA 4 using a multi-stage, context-aware prompting protocol; a realistic, curated dataset integrating SEED Labs, OpenSSL, and PyBugHive; and a tiered evaluation framework measuring detection accuracy, reasoning depth, and patch correctness. A local compilation-based testing pipeline and structured prompt engineering emulate authentic debugging workflows. Contribution/Results: All models achieve high accuracy on basic errors, demonstrating utility in pedagogical support and preliminary triage. However, performance degrades significantly on complex security flaws and large-scale production code. ChatGPT-4 and Claude 3 exhibit superior contextual modeling—particularly for cross-file dependencies and subtle semantic inconsistencies—while LLaMA 4 lags in both detection robustness and repair fidelity. The study highlights critical gaps in LLM-based automated debugging and provides actionable insights for tool integration and prompt design.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development, supporting tasks from code generation to debugging. Yet, their real-world effectiveness in detecting diverse software bugs, particularly complex, security-relevant vulnerabilities, remains underexplored. This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of foundational programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python. The dataset integrates real code from SEED Labs, OpenSSL (via the Suresoft GLaDOS database), and PyBugHive, validated through local compilation and testing pipelines. A novel multi-stage, context-aware prompting protocol simulates realistic debugging scenarios, while a graded rubric measures detection accuracy, reasoning depth, and remediation quality. Our results show that all models excel at identifying syntactic and semantic issues in well-scoped code, making them promising for educational use and as first-pass reviewers in automated code auditing. Performance diminishes in scenarios involving complex security vulnerabilities and large-scale production code, with ChatGPT-4 and Claude 3 generally providing more nuanced contextual analyses than LLaMA 4. This highlights both the promise and the present constraints of LLMs in serving as reliable code analysis tools.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' effectiveness in detecting software bugs and vulnerabilities
Assessing performance on complex security flaws in C++ and Python
Measuring detection accuracy and repair quality in realistic debugging scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage context-aware prompting protocol
Benchmark integrating real code from multiple sources
Graded rubric measuring detection and remediation quality