PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

To address the recall-precision trade-off in PII anonymization for educational tutoring dialogues, this paper proposes a lightweight, context-aware anonymization framework leveraging question-answer structural priors. Methodologically, it introduces question-answer anchoring to simplify PII identification, integrating rule-guided named entity recognition (NER), context-sensitive pattern matching, and a lightweight sequence labeling model. Key contributions include: (1) releasing QATD-2k—the largest publicly available real-world educational dialogue dataset for PII anonymization research; (2) achieving a 12.6% F1-score improvement over prior methods in educational dialogue contexts, with inference speed of 320 tokens/second; and (3) enabling end-to-end, low-latency de-identification, already deployed in multiple educational AI data governance pipelines. The framework balances accuracy, efficiency, and practical deployability without compromising anonymization robustness.

Technology Category

Application Category

📝 Abstract

Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.

Problem

Research questions and friction points this paper is trying to address.

Lightweight NLP anonymization for tutoring dialogues

Improving PII detection using data context knowledge

Addressing error thresholds in anonymization pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight NLP framework for PII anonymization

Uses data context to simplify PII detection

Includes largest open-source tutoring dataset QATD-2k

🔎 Similar Papers

ProxyGPT: Enabling Anonymous Queries in AI Chatbots with (Un)Trustworthy Browser Proxies