Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of large language models (LLMs) to leaking identity attributes, generating harmful content, or producing hallucinations when processing semantically sensitive information (SemSI), a challenge for which existing approaches struggle to balance privacy preservation with textual utility. The authors propose SemSIEdit, a framework that introduces an intelligent “editor” agent during inference to iteratively identify and rewrite sensitive segments, thereby reducing leakage risk while maintaining narrative coherence. The study reveals, for the first time, the privacy–utility Pareto frontier in SemSI handling, uncovering model-size-dependent safety divergence and a “reasoning paradox”: stronger reasoning capabilities simultaneously exacerbate risks and empower safer rewriting. Experiments demonstrate that SemSIEdit reduces information leakage by 34.6% on average across three SemSI tasks, with only a 9.8% loss in text utility, confirming that LLMs can enhance safety through constructive editing.

Technology Category

Application Category

📝 Abstract
While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through constructive expansion (adding nuance), whereas capacity-constrained models revert to destructive truncation (deleting text). Finally, we identify a Reasoning Paradox: while inference-time reasoning increases baseline risk by enabling the model to make deeper sensitive inferences, it simultaneously empowers the defense to execute safe rewrites.
Problem

Research questions and friction points this paper is trying to address.

Semantic Sensitive Information
Large Language Models
Privacy-Utility Tradeoff
Self-Correction
Inference-Time Safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Sensitive Information
Agentic Self-Correction
Inference-time Editing
Privacy-Utility Tradeoff
Reasoning Paradox
🔎 Similar Papers
No similar papers found.
U
Umid Suleymanov
Department of Computer Science, Virginia Tech
Z
Zaur Rajabov
School of IT and Engineering, ADA University
E
Emil Mirzazada
School of IT and Engineering, ADA University
Murat Kantarcioglu
Murat Kantarcioglu
Professor of Computer Science, Virginia Tech
Security and Privacy in AIDatabasesData ScienceComputer Security