🤖 AI Summary
This study addresses the previously unexplored problem of inadvertent leakage of sensitive information—such as API keys and authentication tokens—in GitHub issue reports. To overcome the lack of benchmark datasets and effective detection methods for this scenario, we introduce the first large-scale, manually annotated benchmark containing 5,800 real-world secrets. We propose a lightweight hybrid detection paradigm that synergistically combines regex-based extraction with large language model (LLM)-driven contextual classification. Our method integrates entropy analysis, RoBERTa/CodeBERT feature encoding, and fine-tuned Qwen/LLaMA models, augmented by GPT-4o few-shot learning to enhance generalization. Evaluated on our benchmark, the approach achieves an F1 score of 94.49%; it further attains 81.6% F1 across 178 real-world repositories, significantly outperforming conventional entropy- and keyword-based baselines. This work establishes the first dedicated framework for detecting secret leakage in issue reports, demonstrating the superiority of open-source LLM fine-tuning for this task.
📝 Abstract
In the digital era, accidental exposure of sensitive information such as API keys, tokens, and credentials is a growing security threat. While most prior work focuses on detecting secrets in source code, leakage in software issue reports remains largely unexplored. This study fills that gap through a large-scale analysis and a practical detection pipeline for exposed secrets in GitHub issues. Our pipeline combines regular expression-based extraction with large language model (LLM) based contextual classification to detect real secrets and reduce false positives. We build a benchmark of 54,148 instances from public GitHub issues, including 5,881 manually verified true secrets. Using this dataset, we evaluate entropy-based baselines and keyword heuristics used by prior secret detection tools, classical machine learning, deep learning, and LLM-based methods. Regex and entropy based approaches achieve high recall but poor precision, while smaller models such as RoBERTa and CodeBERT greatly improve performance (F1 = 92.70%). Proprietary models like GPT-4o perform moderately in few-shot settings (F1 = 80.13%), and fine-tuned open-source larger LLMs such as Qwen and LLaMA reach up to 94.49% F1. Finally, we also validate our approach on 178 real-world GitHub repositories, achieving an F1-score of 81.6% which demonstrates our approach's strong ability to generalize to in-the-wild scenarios.