Learning to Focus: Context Extraction for Efficient Code Vulnerability Detection with Language Models

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Language models face challenges in long-code vulnerability detection, including sparse vulnerability signals, imprecise localization, and limited contextual capacity. Existing commit-level annotations—though accurate—are unavailable during inference and cannot directly guide line-level predictions. This paper proposes FocusVul, a model-agnostic framework comprising three stages: (1) learning commit-level annotation patterns to automatically identify sensitive code regions; (2) integrating program dependence graphs (PDGs) with dynamic execution traces for hierarchical, semantics-driven, adaptive context extraction; and (3) lightweight knowledge distillation coupled with language-model-guided encoding to dynamically focus on vulnerability-relevant regions at inference time—without requiring manual annotations. FocusVul is the first approach to generalize from commit-level supervision to line-level inference. On real-world benchmarks, it achieves a 164.04% improvement in classification accuracy and reduces computational overhead by 19.12%, significantly outperforming truncation-based and full-function fine-tuning baselines.

Technology Category

Application Category

📝 Abstract

Language models (LMs) show promise for vulnerability detection but struggle with long, real-world code due to sparse and uncertain vulnerability locations. These issues, exacerbated by token limits, often cause models to miss vulnerability-related signals, thereby impairing effective learning. A key intuition is to enhance LMs with concise, information-rich context. Commit-based annotations offer precise, CWE-agnostic supervision, but are unavailable during inference, as they depend on historical code changes. Moreover, their extreme sparsity, often covering only a few lines, makes it difficult for LMs to process directly. In this paper, we propose FocusVul, a model-agnostic framework that improves LM-based vulnerability detection by learning to select sensitive context. FocusVul learns commit-based annotation patterns through hierarchical semantic modeling and generalizes them to identify line-level vulnerability-relevant regions during inference. It then extracts LM-oriented context via both dependency and execution flows surrounding selected regions, yielding semantically rich inputs for effective vulnerability detection. Experiments on real-world benchmarks show that FocusVul consistently outperforms heuristic-based and full-function fine-tuning approaches, improving classification performance by 164.04% and reducing FLOPs by 19.12% on average.

Problem

Research questions and friction points this paper is trying to address.

Improving vulnerability detection in long code using concise context

Addressing sparse commit-based annotations for effective model learning

Enhancing LM performance with dependency and execution flow context

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical semantic modeling for context selection

Dependency and execution flows for context extraction

Commit-based annotation patterns for vulnerability detection

🔎 Similar Papers

No similar papers found.