🤖 AI Summary
Large language models (LLMs) are vulnerable to privacy violation attacks (PVAs), while existing defenses suffer from exposure risks, high computational overhead, and insufficient robustness. To address these limitations, this paper proposes Retrieval-Confused Generation (RCG), a novel defense paradigm that jointly employs semantics-preserving review rewriting, database semantic perturbation, and a least-relevant retrieval strategy to inject controllable noise into model responses—thereby misleading attackers into extracting incorrect personal information while maintaining high stealth. Crucially, RCG avoids query rejection, preventing adaptive attack evolution. Extensive experiments across two real-world datasets and eight state-of-the-art LLMs demonstrate that RCG improves average defense success rate by 23.7% and reduces inference latency by 68%, significantly outperforming existing anonymization- and rejection-based approaches.
📝 Abstract
Recent advances in large language models (LLMs) have made a profound impact on our society and also raised new security concerns. Particularly, due to the remarkable inference ability of LLMs, the privacy violation attack (PVA), revealed by Staab et al., introduces serious personal privacy issues. Existing defense methods mainly leverage LLMs to anonymize the input query, which requires costly inference time and cannot gain satisfactory defense performance. Moreover, directly rejecting the PVA query seems like an effective defense method, while the defense method is exposed, promoting the evolution of PVA. In this paper, we propose a novel defense paradigm based on retrieval-confused generation (RCG) of LLMs, which can efficiently and covertly defend the PVA. We first design a paraphrasing prompt to induce the LLM to rewrite the "user comments" of the attack query to construct a disturbed database. Then, we propose the most irrelevant retrieval strategy to retrieve the desired user data from the disturbed database. Finally, the "data comments" are replaced with the retrieved user data to form a defended query, leading to responding to the adversary with some wrong personal attributes, i.e., the attack fails. Extensive experiments are conducted on two datasets and eight popular LLMs to comprehensively evaluate the feasibility and the superiority of the proposed defense method.