LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Evaluating the capability of large language models (LLMs) to detect and repair software defects—including syntactic/semantic errors, classic security vulnerabilities, and production-grade complex bugs—in C++ and Python code. Method: We systematically benchmark ChatGPT-4, Claude 3, and LLaMA 4 using a multi-stage, context-aware prompting protocol; a realistic, curated dataset integrating SEED Labs, OpenSSL, and PyBugHive; and a tiered evaluation framework measuring detection accuracy, reasoning depth, and patch correctness. A local compilation-based testing pipeline and structured prompt engineering emulate authentic debugging workflows. Contribution/Results: All models achieve high accuracy on basic errors, demonstrating utility in pedagogical support and preliminary triage. However, performance degrades significantly on complex security flaws and large-scale production code. ChatGPT-4 and Claude 3 exhibit superior contextual modeling—particularly for cross-file dependencies and subtle semantic inconsistencies—while LLaMA 4 lags in both detection robustness and repair fidelity. The study highlights critical gaps in LLM-based automated debugging and provides actionable insights for tool integration and prompt design.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development, supporting tasks from code generation to debugging. Yet, their real-world effectiveness in detecting diverse software bugs, particularly complex, security-relevant vulnerabilities, remains underexplored. This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of foundational programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python. The dataset integrates real code from SEED Labs, OpenSSL (via the Suresoft GLaDOS database), and PyBugHive, validated through local compilation and testing pipelines. A novel multi-stage, context-aware prompting protocol simulates realistic debugging scenarios, while a graded rubric measures detection accuracy, reasoning depth, and remediation quality. Our results show that all models excel at identifying syntactic and semantic issues in well-scoped code, making them promising for educational use and as first-pass reviewers in automated code auditing. Performance diminishes in scenarios involving complex security vulnerabilities and large-scale production code, with ChatGPT-4 and Claude 3 generally providing more nuanced contextual analyses than LLaMA 4. This highlights both the promise and the present constraints of LLMs in serving as reliable code analysis tools.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' effectiveness in detecting software bugs and vulnerabilities

Assessing performance on complex security flaws in C++ and Python

Measuring detection accuracy and repair quality in realistic debugging scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage context-aware prompting protocol

Benchmark integrating real code from multiple sources

Graded rubric measuring detection and remediation quality

🔎 Similar Papers

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?

2024-08-20arXiv.orgCitations: 2

💼 Related Jobs

Principal AI Architect

AT&T Labs

$141,300.00 - $237,400.00

Atlanta, Georgia / Charlotte, North Carolina / Dallas, Texas

Authors to Follow