Secret Breach Prevention in Software Issue Reports

📅 2024-10-31
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the previously unexplored problem of inadvertent leakage of sensitive information—such as API keys and authentication tokens—in GitHub issue reports. To overcome the lack of benchmark datasets and effective detection methods for this scenario, we introduce the first large-scale, manually annotated benchmark containing 5,800 real-world secrets. We propose a lightweight hybrid detection paradigm that synergistically combines regex-based extraction with large language model (LLM)-driven contextual classification. Our method integrates entropy analysis, RoBERTa/CodeBERT feature encoding, and fine-tuned Qwen/LLaMA models, augmented by GPT-4o few-shot learning to enhance generalization. Evaluated on our benchmark, the approach achieves an F1 score of 94.49%; it further attains 81.6% F1 across 178 real-world repositories, significantly outperforming conventional entropy- and keyword-based baselines. This work establishes the first dedicated framework for detecting secret leakage in issue reports, demonstrating the superiority of open-source LLM fine-tuning for this task.

Technology Category

Application Category

📝 Abstract
In the digital era, accidental exposure of sensitive information such as API keys, tokens, and credentials is a growing security threat. While most prior work focuses on detecting secrets in source code, leakage in software issue reports remains largely unexplored. This study fills that gap through a large-scale analysis and a practical detection pipeline for exposed secrets in GitHub issues. Our pipeline combines regular expression-based extraction with large language model (LLM) based contextual classification to detect real secrets and reduce false positives. We build a benchmark of 54,148 instances from public GitHub issues, including 5,881 manually verified true secrets. Using this dataset, we evaluate entropy-based baselines and keyword heuristics used by prior secret detection tools, classical machine learning, deep learning, and LLM-based methods. Regex and entropy based approaches achieve high recall but poor precision, while smaller models such as RoBERTa and CodeBERT greatly improve performance (F1 = 92.70%). Proprietary models like GPT-4o perform moderately in few-shot settings (F1 = 80.13%), and fine-tuned open-source larger LLMs such as Qwen and LLaMA reach up to 94.49% F1. Finally, we also validate our approach on 178 real-world GitHub repositories, achieving an F1-score of 81.6% which demonstrates our approach's strong ability to generalize to in-the-wild scenarios.
Problem

Research questions and friction points this paper is trying to address.

Detecting exposed secrets like API keys in GitHub issue reports
Addressing security threats from accidental sensitive information exposure
Developing detection methods to reduce false positives in secret identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining regex extraction with LLM classification for secrets
Building benchmark dataset from GitHub issue reports
Fine-tuning open-source LLMs for high detection accuracy
🔎 Similar Papers
No similar papers found.
Z
Zahin Wahab
The University of British Columbia, Vancouver, BC, Canada
S
Sadif Ahmed
Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
M
Md Nafiu Rahman
Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
Rifat Shahriyar
Rifat Shahriyar
Professor, Department of CSE, BUET
Memory ManagementProgramming LanguagesSoftware EngineeringNatural Language Processing
Gias Uddin
Gias Uddin
Associate Professor, York University
ProductivityAI4SESE4AITestingSecurity