🤖 AI Summary
This work proposes ReasonVul, a novel framework for vulnerability detection that addresses the limitations of existing large language model (LLM)-based approaches, which often rely on a single reasoning paradigm and struggle with the complexity and diversity of real-world vulnerabilities. ReasonVul introduces a multi-perspective collaborative reasoning mechanism, where three LLM agents—each employing a distinct reasoning strategy—independently analyze code and subsequently engage in a structured debate to resolve conflicts and reach consensus. This process leverages cognitive complementarity and iterative error correction to enhance detection accuracy and robustness. Evaluated on benchmark datasets, ReasonVul achieves a PairAcc of 40.00% (an 81.24% relative improvement over the best baseline) and an F1-score of 72.52% on PrimeVul, along with 28.67% PairAcc on JITVUL, with 72% of conflicting cases correctly resolved.
📝 Abstract
Automated vulnerability detection is crucial for enhancing software security by identifying potential flaws that attackers could exploit, thereby reducing the reliance on labor-intensive manual code audits. Recent advancements have shifted towards leveraging large language models (LLMs) for vulnerability detection, with techniques like Vul-RAG and VulnSage demonstrating progress through structured prompting and external knowledge integration. However, these approaches typically rely on a single reasoning paradigm, limiting their ability to address the complex and diverse nature of real-world vulnerabilities. To overcome these limitations, we propose ReasonVul, a novel multi-perspective reasoning framework that harnesses cognitive synergy among three specialized LLM agents, each embodying a distinct reasoning mode. The framework begins with independent analyses of the source code, followed by a structured debate mechanism to resolve conflicts through iterative rebuttal and revision, ultimately converging on a collaborative judgment. Evaluated on the PrimeVul dataset, ReasonVul achieves a PairAcc of 40.00% and an F1-score of 72.52%, surpassing the best baseline by 81.24% in PairAcc. Further tests on the JITVUL dataset confirm its generalizability, with a PairAcc of 28.67%. Additionally, we analyzed 542 conflict cases and found that 389 were correctly resolved, highlighting the framework's ability to uncover hidden vulnerabilities through the error-correction mechanism driven by the debate. This work emphasizes the importance of multi-perspective reasoning and collaborative validation in achieving robust and comprehensive vulnerability detection in real-world software systems.