Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work investigates the implicit leakage of confidential information by large language models (LLMs) during text generation, even when such information is not explicitly referenced. The authors propose a dual-model framework: one model generates narratives while attempting to conceal a designated secret word, and a second model performs binary classification to detect whether the secret has been leaked. Experiments reveal that five prominent LLMs all leak secrets significantly above random chance—reaching up to 79% detection accuracy—and that deliberate concealment efforts often introduce detectable avoidance patterns. The leakage exhibits cross-model readability and scales with model size, though it diminishes in short-form texts such as jokes. This study is the first to demonstrate the existence of a persistent, hard-to-suppress channel for implicit information leakage in LLMs and provides a systematic analysis of its underlying mechanisms.

📝 Abstract

Language models are deployed in settings that require compartmentalization: system prompts should not be disclosed, chain-of-thought reasoning is hidden from users, and sensitive data passes through shared contexts. We test whether models can keep prompted information out of their writing. We give each model a secret word with instructions not to reveal it, then ask it to write a story. A second model tries to identify the secret from the story in a binary discrimination test. The secret word never appears literally in any output, but all five frontier models we test leak it thematically -- through topic choice, imagery, and setting--6hy-at rates significantly different from chance, up to 79\%. When told to actively hide the secret, models write \emph{away from} it, and this avoidance is itself detectable. The leakage is cross-model readable, scales sharply with model size within two model families, and disappears entirely for short-form writing like jokes. Giving the model a decoy concept to ``focus on instead'' partially redirects the leakage from the real secret to the decoy. Attending to a secret appears to open up an information channel that frontier LLMs cannot close, even when instructed to.

Problem

Research questions and friction points this paper is trying to address.

information leakage

language models

compartmentalization

secret keeping

involuntary disclosure

Innovation

Methods, ideas, or system contributions that make the work stand out.

involuntary information leakage

language model secrecy

thematic leakage