🤖 AI Summary
This work investigates the implicit leakage of confidential information by large language models (LLMs) during text generation, even when such information is not explicitly referenced. The authors propose a dual-model framework: one model generates narratives while attempting to conceal a designated secret word, and a second model performs binary classification to detect whether the secret has been leaked. Experiments reveal that five prominent LLMs all leak secrets significantly above random chance—reaching up to 79% detection accuracy—and that deliberate concealment efforts often introduce detectable avoidance patterns. The leakage exhibits cross-model readability and scales with model size, though it diminishes in short-form texts such as jokes. This study is the first to demonstrate the existence of a persistent, hard-to-suppress channel for implicit information leakage in LLMs and provides a systematic analysis of its underlying mechanisms.
📝 Abstract
Language models are deployed in settings that require compartmentalization: system prompts should not be disclosed, chain-of-thought reasoning is hidden from users, and sensitive data passes through shared contexts. We test whether models can keep prompted information out of their writing. We give each model a secret word with instructions not to reveal it, then ask it to write a story. A second model tries to identify the secret from the story in a binary discrimination test. The secret word never appears literally in any output, but all five frontier models we test leak it thematically -- through topic choice, imagery, and setting--6hy-at rates significantly different from chance, up to 79\%. When told to actively hide the secret, models write \emph{away from} it, and this avoidance is itself detectable. The leakage is cross-model readable, scales sharply with model size within two model families, and disappears entirely for short-form writing like jokes. Giving the model a decoy concept to ``focus on instead'' partially redirects the leakage from the real secret to the decoy. Attending to a secret appears to open up an information channel that frontier LLMs cannot close, even when instructed to.