SIExVulTS: Sensitive Information Exposure Vulnerability Detection System using Transformer Models and Static Analysis

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Sensitive information exposure (CWE-200) remains a persistent and under-detected vulnerability; existing tools fail to jointly model the subclass diversity and code-level data-flow semantics. This paper proposes the first end-to-end, three-stage detection framework integrating semantic embedding, static analysis, and graph neural networks: (1) context-aware semantic representations are generated via Sentence-BERT; (2) CodeQL precisely identifies sensitive source/sink patterns; and (3) GraphCodeBERT models inter-procedural control- and data-flow paths to verify semantic correctness. Our approach is the first to systematically cover all canonical CWE-200 scenarios. Evaluated on multiple benchmark datasets, it achieves an F1-score of 93.1%, with precision markedly improved from 22.61% to 87.23%. Furthermore, it uncovers six previously unknown CVEs in Apache projects.

Technology Category

Application Category

📝 Abstract

Sensitive Information Exposure (SIEx) vulnerabilities (CWE-200) remain a persistent and under-addressed threat across software systems, often leading to serious security breaches. Existing detection tools rarely target the diverse subcategories of CWE-200 or provide context-aware analysis of code-level data flows. Aims: This paper aims to present SIExVulTS, a novel vulnerability detection system that integrates transformer-based models with static analysis to identify and verify sensitive information exposure in Java applications. Method: SIExVulTS employs a three-stage architecture: (1) an Attack Surface Detection Engine that uses sentence embeddings to identify sensitive variables, strings, comments, and sinks; (2) an Exposure Analysis Engine that instantiates CodeQL queries aligned with the CWE-200 hierarchy; and (3) a Flow Verification Engine that leverages GraphCodeBERT to semantically validate source-to-sink flows. We evaluate SIExVulTS using three curated datasets, including real-world CVEs, a benchmark set of synthetic CWE-200 examples, and labeled flows from 31 open-source projects. Results: The Attack Surface Detection Engine achieved an average F1 score greater than 93%, the Exposure Analysis Engine achieved an F1 score of 85.71%, and the Flow Verification Engine increased precision from 22.61% to 87.23%. Moreover, SIExVulTS successfully uncovered six previously unknown CVEs in major Apache projects. Conclusions: The results demonstrate that SIExVulTS is effective and practical for improving software security against sensitive data exposure, addressing limitations of existing tools in detecting and verifying CWE-200 vulnerabilities.

Problem

Research questions and friction points this paper is trying to address.

Detecting sensitive information exposure vulnerabilities in Java applications

Addressing limitations of existing tools for CWE-200 subcategories

Providing context-aware analysis of code-level data flows

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer models integrated with static analysis

Three-stage architecture for vulnerability detection

GraphCodeBERT for semantic flow validation

🔎 Similar Papers

No similar papers found.