Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit inconsistent and unstable performance in source code vulnerability detection, particularly under class imbalance and multi-class settings. Method: This paper proposes Dynamic Gating Stacking (DGS), a novel stacking ensemble framework inspired by Mixture of Experts (MoE), which adaptively fuses predictions from multiple LLMs while explicitly modeling class imbalance and multi-class characteristics. Contribution/Results: Evaluated on Devign, ReVeal, and BigVul benchmarks, DGS significantly improves F1-score and AUC over conventional Bagging, Boosting, and standard Stacking. Empirical analysis reveals that Boosting excels in highly imbalanced scenarios, whereas DGS consistently outperforms standard Stacking across all configurations—validating its gating mechanism’s efficacy in harmonizing heterogeneous model outputs. This work establishes a more robust and scalable ensemble paradigm for LLM-driven vulnerability detection.

Technology Category

Application Category

📝 Abstract
Code vulnerability detection is crucial for ensuring the security and reliability of modern software systems. Recently, Large Language Models (LLMs) have shown promising capabilities in this domain. However, notable discrepancies in detection results often arise when analyzing identical code segments across different training stages of the same model or among architecturally distinct LLMs. While such inconsistencies may compromise detection stability, they also highlight a key opportunity: the latent complementarity among models can be harnessed through ensemble learning to create more robust vulnerability detection systems. In this study, we explore the potential of ensemble learning to enhance the performance of LLMs in source code vulnerability detection. We conduct comprehensive experiments involving five LLMs (i.e., DeepSeek-Coder-6.7B, CodeLlama-7B, CodeLlama-13B, CodeQwen1.5-7B, and StarCoder2-15B), using three ensemble strategies (i.e., Bagging, Boosting, and Stacking). These experiments are carried out across three widely adopted datasets (i.e., Devign, ReVeal, and BigVul). Inspired by Mixture of Experts (MoE) techniques, we further propose Dynamic Gated Stacking (DGS), a Stacking variant tailored for vulnerability detection. Our results demonstrate that ensemble approaches can significantly improve detection performance, with Boosting excelling in scenarios involving imbalanced datasets. Moreover, DGS consistently outperforms traditional Stacking, particularly in handling class imbalance and multi-class classification tasks. These findings offer valuable insights into building more reliable and effective LLM-based vulnerability detection systems through ensemble learning.
Problem

Research questions and friction points this paper is trying to address.

Improving code vulnerability detection stability across LLMs
Harnessing model complementarity through ensemble learning strategies
Addressing class imbalance in multi-class vulnerability classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble learning with multiple LLMs
Dynamic Gated Stacking variant
Boosting for imbalanced datasets
🔎 Similar Papers
No similar papers found.
Z
Zhihong Sun
Shandong Normal University, China
J
Jia Li
Peking University, China
Yao Wan
Yao Wan
Huazhong University of Science and Technology
NLPProgramming LanguagesSoftware EngineeringLarge Language Models
C
Chuanyi Li
National Key Laboratory for Novel Software Technology, Nanjing University, China
Hongyu Zhang
Hongyu Zhang
Chongqing University
Software EngineeringMining Software RepositoriesData-driven Software EngineeringSoftware Analytics
Zhi Jin
Zhi Jin
Sun Yat-Sen University, Associate Professor
Ge Li
Ge Li
Full Professor of Computer Science, Peking University
Program AnalysisProgram GenerationDeep Learning
H
Hong Liu
Shandong Normal University, China
Chen Lyu
Chen Lyu
Wuhan University
natural language processing
S
Songlin Hu
Institute of Information Engineering, Chinese Academy of Sciences, China