Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of over-redaction in mathematical tutoring dialogues, where general-purpose PII detection systems frequently misclassify numerical expressions as personally identifiable information, thereby compromising the utility of educational data. To tackle this issue, the authors introduce the concept of “numeric ambiguity” and present MathEd-PII, the first PII annotation benchmark specifically tailored for mathematics education. They further propose a human-in-the-loop LLM workflow that integrates density-based segmentation with three prompting strategies—basic, math-aware, and segment-aware—to achieve precise redaction. Experimental results demonstrate that the math-aware prompting strategy achieves an F1 score of 0.821, substantially outperforming the baseline (0.379) and significantly reducing erroneous deletion of numeric content. These findings underscore the critical role of domain-specific context in balancing privacy preservation with pedagogical utility.

Technology Category

Application Category

📝 Abstract
Large-scale sharing of dialogue-based data is instrumental for advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce dataset utility. This work asks how PII can be detected in math tutoring transcripts while preserving their educational utility. To address this challenge, we investigate the "numeric ambiguity" problem and introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, created through a human-in-the-loop LLM workflow that audits upstream redactions and generates privacy-preserving surrogates. The dataset contains 1,000 tutoring sessions (115,620 messages; 769,628 tokens) with validated PII annotations. Using a density-based segmentation method, we show that false PII redactions are disproportionately concentrated in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and LLM-based approaches with basic, math-aware, and segment-aware prompting. Math-aware prompting substantially improves performance over the baseline (F1: 0.821 vs. 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides both a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.
Problem

Research questions and friction points this paper is trying to address.

de-identification
numeric ambiguity
PII detection
math tutoring
utility preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

numeric ambiguity
utility-preserving de-identification
MathEd-PII
math-aware prompting
domain-aware PII detection
Z
Zhuqian Zhou
Cornell University
K
Kirk Vanacore
Cornell University
B
Bakhtawar Ahtisham
Cornell University
Jinsook Lee
Jinsook Lee
Cornell University
Data Science in EducationComputational Social ScienceAI Evaluation
D
Doug Pietrzak
Fresh Cognate
D
Daryl Hedley
Fresh Cognate
J
Jorge Dias
Fresh Cognate
C
Chris Shaw
UPchieve
R
Ruth Schäfer
Saga Education
R
René F. Kizilcec
Cornell University