An Empirical Study of Automated Vulnerability Localization with Large Language Models

📅 2024-03-30

🏛️ arXiv.org

📈 Citations: 25

✨ Influential: 2

career value

227K/year

🤖 AI Summary

This work systematically evaluates the effectiveness of large language models (LLMs) for line-level vulnerability localization (AVL)—a task lacking comprehensive empirical investigation. Experiments are conducted on BigVul (C/C++) and smart contract vulnerability datasets, covering over ten code-understanding LLMs (60M–16B parameters) spanning encoder-only, encoder-decoder, and decoder-only architectures, under zero-shot, one-shot, discriminative fine-tuning, and generative fine-tuning paradigms. Key contributions include: (1) the first empirical demonstration that discriminative fine-tuning substantially outperforms existing approaches; (2) the proposal of sliding-window context partitioning and right-forward embedding to mitigate context-length limitations; and (3) strong cross-CWE and cross-project generalization, yielding significant improvements in localization accuracy and surpassing state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Recently, Automated Vulnerability Localization (AVL) has attracted much attention, aiming to facilitate diagnosis by pinpointing the lines of code responsible for discovered vulnerabilities. Large Language Models (LLMs) have shown potential in various domains, yet their effectiveness in vulnerability localization remains underexplored. In this work, we perform the first comprehensive study of LLMs for AVL. Our investigation encompasses 10+ leading LLMs suitable for code analysis, including ChatGPT and various open-source models, across three architectural types: encoder-only, encoder-decoder, and decoder-only, with model sizes ranging from 60M to 16B parameters. We explore the efficacy of these LLMs using 4 distinct paradigms: zero-shot learning, one-shot learning, discriminative fine-tuning, and generative fine-tuning. Our evaluation framework is applied to the BigVul-based dataset for C/C++, and an additional dataset comprising smart contract vulnerabilities. The results demonstrate that discriminative fine-tuning of LLMs can significantly outperform existing learning-based methods for AVL, while other paradigms prove less effective or unexpectedly ineffective for the task. We also identify challenges related to input length and unidirectional context in fine-tuning processes for encoders and decoders. We then introduce two remedial strategies: the sliding window and the right-forward embedding, both of which substantially enhance performance. Furthermore, our findings highlight certain generalization capabilities of LLMs across Common Weakness Enumerations (CWEs) and different projects, indicating a promising pathway toward their practical application in vulnerability localization.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs for line-level vulnerability localization in code

Compares fine-tuning and prompting across diverse LLM architectures

Assesses generalizability and proposes strategies for input limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discriminative fine-tuning outperforms existing learning-based AVL methods

Prompting advanced LLMs like ChatGPT works best in low-data settings

Sliding window and right-forward embedding address input length challenges

🔎 Similar Papers

APPATCH: Automated Adaptive Prompting Large Language Models for Real-World Software Vulnerability Patching