🤖 AI Summary
Existing text unlearning mechanisms in language models suffer from a fundamental flaw: rather than reliably erasing sensitive memories, they inadvertently amplify membership inference and data reconstruction risks—inducing a “false unlearning” effect. Method: We propose the U-LiRA+ auditing framework and the TULA leakage attack (with both black-box and white-box variants), the first systematic approach to expose how text unlearning paradoxically increases training data exposure. Using multi-dimensional auditing—including likelihood ratio testing, membership inference, and reverse reconstruction—we empirically evaluate mainstream unlearning methods. Contribution/Results: Our experiments demonstrate that these methods significantly elevate attack success rates. The work fundamentally challenges the prevailing “unlearning implies security” paradigm, establishing critical security boundaries and a rigorous evaluation benchmark for trustworthy machine unlearning.
📝 Abstract
Language Models (LMs) are prone to ''memorizing'' training data, including substantial sensitive user information. To mitigate privacy risks and safeguard the right to be forgotten, machine unlearning has emerged as a promising approach for enabling LMs to efficiently ''forget'' specific texts. However, despite the good intentions, is textual unlearning really as effective and reliable as expected? To address the concern, we first propose Unlearning Likelihood Ratio Attack+ (U-LiRA+), a rigorous textual unlearning auditing method, and find that unlearned texts can still be detected with very high confidence after unlearning. Further, we conduct an in-depth investigation on the privacy risks of textual unlearning mechanisms in deployment and present the Textual Unlearning Leakage Attack (TULA), along with its variants in both black- and white-box scenarios. We show that textual unlearning mechanisms could instead reveal more about the unlearned texts, exposing them to significant membership inference and data reconstruction risks. Our findings highlight that existing textual unlearning actually gives a false sense of unlearning, underscoring the need for more robust and secure unlearning mechanisms.