Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the vulnerability of large language models (LLMs) to leaking identity attributes, generating harmful content, or producing hallucinations when processing semantically sensitive information (SemSI), a challenge for which existing approaches struggle to balance privacy preservation with textual utility. The authors propose SemSIEdit, a framework that introduces an intelligent “editor” agent during inference to iteratively identify and rewrite sensitive segments, thereby reducing leakage risk while maintaining narrative coherence. The study reveals, for the first time, the privacy–utility Pareto frontier in SemSI handling, uncovering model-size-dependent safety divergence and a “reasoning paradox”: stronger reasoning capabilities simultaneously exacerbate risks and empower safer rewriting. Experiments demonstrate that SemSIEdit reduces information leakage by 34.6% on average across three SemSI tasks, with only a 9.8% loss in text utility, confirming that LLMs can enhance safety through constructive editing.

Technology Category

Application Category

📝 Abstract

While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through constructive expansion (adding nuance), whereas capacity-constrained models revert to destructive truncation (deleting text). Finally, we identify a Reasoning Paradox: while inference-time reasoning increases baseline risk by enabling the model to make deeper sensitive inferences, it simultaneously empowers the defense to execute safe rewrites.

Problem

Research questions and friction points this paper is trying to address.

Semantic Sensitive Information

Large Language Models

Privacy-Utility Tradeoff

Self-Correction

Inference-Time Safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Sensitive Information

Agentic Self-Correction

Inference-time Editing

Privacy-Utility Tradeoff

Reasoning Paradox

🔎 Similar Papers

No similar papers found.

Authors to Follow