PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues

πŸ“… 2025-05-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the recall-precision trade-off in PII anonymization for educational tutoring dialogues, this paper proposes a lightweight, context-aware anonymization framework leveraging question-answer structural priors. Methodologically, it introduces question-answer anchoring to simplify PII identification, integrating rule-guided named entity recognition (NER), context-sensitive pattern matching, and a lightweight sequence labeling model. Key contributions include: (1) releasing QATD-2kβ€”the largest publicly available real-world educational dialogue dataset for PII anonymization research; (2) achieving a 12.6% F1-score improvement over prior methods in educational dialogue contexts, with inference speed of 320 tokens/second; and (3) enabling end-to-end, low-latency de-identification, already deployed in multiple educational AI data governance pipelines. The framework balances accuracy, efficiency, and practical deployability without compromising anonymization robustness.

Technology Category

Application Category

πŸ“ Abstract
Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.
Problem

Research questions and friction points this paper is trying to address.

Lightweight NLP anonymization for tutoring dialogues
Improving PII detection using data context knowledge
Addressing error thresholds in anonymization pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight NLP framework for PII anonymization
Uses data context to simplify PII detection
Includes largest open-source tutoring dataset QATD-2k
πŸ”Ž Similar Papers
No similar papers found.