Towards Contextual Sensitive Data Detection

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing sensitive data detection methods predominantly focus on personally identifiable information (PII), overlooking the contextual dependency of data sensitivity. This paper proposes a novel context-aware paradigm for sensitive data detection, introducing— for the first time—type contextualization and domain contextualization mechanisms. The approach integrates semantic type identification, document-level contextual modeling, sensitive rule retrieval, and large language model (LLM)-driven reasoning into an end-to-end detection framework. Evaluated on non-standard data domains—such as humanitarian datasets—the method achieves a 94% recall in type contextualization, outperforming commercial tools by 31 percentage points. Domain contextualization significantly enhances adaptability to complex, real-world scenarios. Furthermore, LLM-generated interpretive explanations substantially improve inter-annotator agreement during manual review. Collectively, this work advances sensitive data detection by grounding sensitivity assessment in rich, multi-granular contextual signals rather than static, syntax-driven patterns.

Technology Category

Application Category

📝 Abstract

The emergence of open data portals necessitates more attention to protecting sensitive data before datasets get published and exchanged. While an abundance of methods for suppressing sensitive data exist, the conceptualization of sensitive data and methods to detect it, focus particularly on personal data that, if disclosed, may be harmful or violate privacy. We observe the need for refining and broadening our definitions of sensitive data, and argue that the sensitivity of data depends on its context. Based on this definition, we introduce two mechanisms for contextual sensitive data detection that con- sider the broader context of a dataset at hand. First, we introduce type contextualization, which first detects the semantic type of particular data values, then considers the overall context of the data values within the dataset or document. Second, we introduce domain contextualization which determines sensitivity of a given dataset in the broader context based on the retrieval of relevant rules from documents that specify data sensitivity (e.g., data topic and geographic origin). Experiments with these mechanisms, assisted by large language models (LLMs), confirm that: 1) type-contextualization significantly reduces the number of false positives for type-based sensitive data detection and reaches a recall of 94% compared to 63% with commercial tools, and 2) domain-contextualization leveraging sensitivity rule retrieval is effective for context-grounded sensitive data detection in non-standard data domains such as humanitarian datasets. Evaluation with humanitarian data experts also reveals that context-grounded LLM explanations provide useful guidance in manual data auditing processes, improving consistency. We open-source mechanisms and annotated datasets for contextual sensitive data detection at https://github.com/trl-lab/sensitive-data-detection.

Problem

Research questions and friction points this paper is trying to address.

Detecting sensitive data in open datasets using contextual awareness

Reducing false positives in sensitive data detection through semantic analysis

Applying domain-specific rules for sensitivity detection in non-standard datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Type contextualization detects semantic types and dataset context

Domain contextualization retrieves sensitivity rules from relevant documents

Large language models assist both mechanisms to improve detection accuracy

🔎 Similar Papers

Privacy Checklist: Privacy Violation Detection Grounding on Contextual Integrity Theory