🤖 AI Summary
This work addresses the challenge of contextual privacy leakage in retrieval-augmented generation (RAG) systems caused by unconventional combinations of personally identifiable information (PII) attributes, which existing PII filters struggle to mitigate. The authors propose T3+OCSVM, a privacy policy enforcement framework that integrates textual embeddings with one-class support vector machine (OCSVM) density estimation, enhanced by a calibrated rejection region to robustly handle out-of-distribution inputs. A hierarchical, axis-aligned multi-LLM synthetic data pipeline is developed for training and validation. Under boundary security stress tests, the method achieves an AUROC exceeding 0.93 and reduces false positive rates by 44–55 percentage points compared to baselines, while maintaining millisecond-level latency. These results demonstrate substantial improvements over supervised classifiers and LLM-based adjudication approaches, offering both high detection efficacy and practical deployability.
📝 Abstract
Standard PII filters often miss contextual data leakage in RAG systems, such as non-regulated attribute clusters that collectively identify individuals. We introduce a Privacy Policy Enforcement (PPE) framework using dual one-class density estimators with fused text embeddings and a calibrated abstain region for out-of-distribution inputs. Using an axis-stratified, multi-LLM synthetic data pipeline across medicine, finance, and law, we found that traditional Gaussian Mixture baselines fail on borderline-safe stress tests by focusing on linguistic register rather than content.
Our proposed T3+OCSVM detector, trained on safe and borderline-safe data, achieves a borderline AUROC of 0.93+ while reducing false positives by 44-55 percentage points and maintaining millisecond latency. Compared to supervised MLP classifiers or 14B-parameter LLM judges, our framework offers superior operational suitability, as the former suffers from high abstention rates and the latter from latency and calibration issues. This methodology provides a robust stress-testing standard for any synthetic-data-trained classifier.