🤖 AI Summary
This work challenges the conventional assumption that internal states of large language models (LLMs) are irreversible, exposing critical privacy leakage risks. We propose four novel internal state inversion attacks that integrate two-stage optimization-based inversion, cross-model transferability, and generative translation modeling to reconstruct multi-layer internal representations efficiently under both white-box and black-box settings. By overcoming local optima, our approach achieves unprecedented semantic fidelity and token-level accuracy—reaching an F1 score of 86.88% on a 4112-token medical input—demonstrating that deep-layer states offer no inherent privacy guarantee. We validate the universality of these attacks across six mainstream LLMs, and show that existing defenses fail to provide comprehensive protection. This work delivers foundational insights into collaborative inference security and establishes a rigorous technical benchmark for model auditing and privacy assessment.
📝 Abstract
Large Language Models (LLMs) are increasingly integrated into daily routines, yet they raise significant privacy and safety concerns. Recent research proposes collaborative inference, which outsources the early-layer inference to ensure data locality, and introduces model safety auditing based on inner neuron patterns. Both techniques expose the LLM's Internal States (ISs), which are traditionally considered irreversible to inputs due to optimization challenges and the highly abstract representations in deep layers. In this work, we challenge this assumption by proposing four inversion attacks that significantly improve the semantic similarity and token matching rate of inverted inputs. Specifically, we first develop two white-box optimization-based attacks tailored for low-depth and high-depth ISs. These attacks avoid local minima convergence, a limitation observed in prior work, through a two-phase inversion process. Then, we extend our optimization attack under more practical black-box weight access by leveraging the transferability between the source and the derived LLMs. Additionally, we introduce a generation-based attack that treats inversion as a translation task, employing an inversion model to reconstruct inputs. Extensive evaluation of short and long prompts from medical consulting and coding assistance datasets and 6 LLMs validates the effectiveness of our inversion attacks. Notably, a 4,112-token long medical consulting prompt can be nearly perfectly inverted with 86.88 F1 token matching from the middle layer of Llama-3 model. Finally, we evaluate four practical defenses that we found cannot perfectly prevent ISs inversion and draw conclusions for future mitigation design.