π€ AI Summary
Chain-of-thought (CoT) reasoning, while enhancing large language model performance, inadvertently leaks personally identifiable information (PII) from promptsβeven when explicitly instructed not to repeat such content. This work presents the first systematic characterization and tracking of PII leakage trajectories within CoT reasoning. We introduce a model-agnostic framework that quantifies risk-weighted PII leakage across 11 PII categories under varying reasoning budgets and evaluates multiple lightweight gating mechanisms. By integrating rule-based detectors, TF-IDF with logistic regression, GLiNER, and LLM-as-judge approaches within a hierarchical risk classification scheme, our experiments demonstrate that CoT consistently exacerbates high-risk PII exposure and that no single gating strategy is universally effective. These findings underscore the necessity and efficacy of hybrid, adaptive gating mechanisms to mitigate privacy risks in CoT reasoning.
π Abstract
Chain-of-Thought (CoT) prompting improves LLM reasoning but can increase privacy risk by resurfacing personally identifiable information (PII) from the prompt into reasoning traces and outputs, even under policies that instruct the model not to restate PII. We study such direct, inference-time PII leakage using a model-agnostic framework that (i) defines leakage as risk-weighted, token-level events across 11 PII types, (ii) traces leakage curves as a function of the allowed CoT budget, and (iii) compares open- and closed-source model families on a structured PII dataset with a hierarchical risk taxonomy. We find that CoT consistently elevates leakage, especially for high-risk categories, and that leakage is strongly family- and budget-dependent. Increasing the reasoning budget can either amplify or attenuate leakage depending on the base model. We then benchmark lightweight inference-time gatekeepers: a rule-based detector, a TF-IDF + logistic regression classifier, a GLiNER-based NER model, and an LLM-as-judge, using risk-weighted F1, Macro-F1, and recall. No single method dominates across models or budgets, motivating hybrid, style-adaptive gatekeeping policies that balance utility and risk under a common, reproducible protocol.