CAPID: Context-Aware PII Detection for Question-Answering Systems

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing personally identifiable information (PII) detection methods, which often indiscriminately remove all PII without considering its contextual relevance in question-answering scenarios, thereby degrading response quality. To overcome this, the authors propose the first privacy-preserving framework that jointly performs fine-grained PII type identification and context-aware relevance assessment. The approach leverages synthetic data generation to address training data scarcity and fine-tunes a compact local language model to classify PII spans and score their contextual relevance before they are processed by a large language model. Experimental results demonstrate that the method significantly outperforms baseline approaches in PII span detection, type classification, and relevance judgment, achieving effective anonymization while substantially improving the utility of downstream question-answering systems.

Technology Category

Application Category

📝 Abstract
Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user's question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic data generation pipeline that leverages LLMs to produce a diverse, domain-rich dataset spanning multiple PII types and relevance levels. Using this dataset, we fine-tune an SLM to detect PII spans, classify their types, and estimate contextual relevance. Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy while preserving significantly higher downstream utility under anonymization.
Problem

Research questions and friction points this paper is trying to address.

PII detection
context-aware
question-answering systems
privacy preservation
contextual relevance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-Aware PII Detection
Small Language Model
Synthetic Data Generation
Privacy-Preserving QA
Relevance-Aware Anonymization
🔎 Similar Papers
No similar papers found.
M
Mariia Ponomarenko
University of Waterloo, Vector Institute
S
Sepideh Abedini
University of Waterloo, Vector Institute
Masoumeh Shafieinejad
Masoumeh Shafieinejad
Researcher at Vector Institute
Security & Privacy - Machine Learning and Data Analysis
D
D. B. Emerson
Vector Institute
S
Shubhankar Mohapatra
University of Waterloo, Vector Institute
Xi He
Xi He
University of Waterloo
PrivacySecurityDatabaseLocation data