🤖 AI Summary
Sensitive information exposure (CWE-200) remains a persistent and under-detected vulnerability; existing tools fail to jointly model the subclass diversity and code-level data-flow semantics. This paper proposes the first end-to-end, three-stage detection framework integrating semantic embedding, static analysis, and graph neural networks: (1) context-aware semantic representations are generated via Sentence-BERT; (2) CodeQL precisely identifies sensitive source/sink patterns; and (3) GraphCodeBERT models inter-procedural control- and data-flow paths to verify semantic correctness. Our approach is the first to systematically cover all canonical CWE-200 scenarios. Evaluated on multiple benchmark datasets, it achieves an F1-score of 93.1%, with precision markedly improved from 22.61% to 87.23%. Furthermore, it uncovers six previously unknown CVEs in Apache projects.
📝 Abstract
Sensitive Information Exposure (SIEx) vulnerabilities (CWE-200) remain a persistent and under-addressed threat across software systems, often leading to serious security breaches. Existing detection tools rarely target the diverse subcategories of CWE-200 or provide context-aware analysis of code-level data flows.
Aims: This paper aims to present SIExVulTS, a novel vulnerability detection system that integrates transformer-based models with static analysis to identify and verify sensitive information exposure in Java applications.
Method: SIExVulTS employs a three-stage architecture: (1) an Attack Surface Detection Engine that uses sentence embeddings to identify sensitive variables, strings, comments, and sinks; (2) an Exposure Analysis Engine that instantiates CodeQL queries aligned with the CWE-200 hierarchy; and (3) a Flow Verification Engine that leverages GraphCodeBERT to semantically validate source-to-sink flows. We evaluate SIExVulTS using three curated datasets, including real-world CVEs, a benchmark set of synthetic CWE-200 examples, and labeled flows from 31 open-source projects.
Results: The Attack Surface Detection Engine achieved an average F1 score greater than 93%, the Exposure Analysis Engine achieved an F1 score of 85.71%, and the Flow Verification Engine increased precision from 22.61% to 87.23%. Moreover, SIExVulTS successfully uncovered six previously unknown CVEs in major Apache projects.
Conclusions: The results demonstrate that SIExVulTS is effective and practical for improving software security against sensitive data exposure, addressing limitations of existing tools in detecting and verifying CWE-200 vulnerabilities.