π€ AI Summary
To address the recall-precision trade-off in PII anonymization for educational tutoring dialogues, this paper proposes a lightweight, context-aware anonymization framework leveraging question-answer structural priors. Methodologically, it introduces question-answer anchoring to simplify PII identification, integrating rule-guided named entity recognition (NER), context-sensitive pattern matching, and a lightweight sequence labeling model. Key contributions include: (1) releasing QATD-2kβthe largest publicly available real-world educational dialogue dataset for PII anonymization research; (2) achieving a 12.6% F1-score improvement over prior methods in educational dialogue contexts, with inference speed of 320 tokens/second; and (3) enabling end-to-end, low-latency de-identification, already deployed in multiple educational AI data governance pipelines. The framework balances accuracy, efficiency, and practical deployability without compromising anonymization robustness.
π Abstract
Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.