Eliciting Secret Knowledge from Language Models

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study addresses the extraction and auditing of “secret knowledge”—latent, unexpressed knowledge possessed by language models but not readily accessible through standard prompting. We introduce the first publicly available benchmark for secret knowledge extraction, systematically comparing black-box methods (e.g., prefill attacks) against white-box approaches (e.g., logit lens, sparse autoencoders). Empirical evaluation reveals black-box techniques generally outperform white-box ones, though the latter exhibit advantages in specific configurations. To enable controlled analysis, we propose training strategies that explicitly induce knowledge hiding or exposure, and design a reproducible probing framework. Experiments demonstrate that our method significantly surpasses baselines on two-thirds of evaluated tasks, achieving the first efficient and verifiable inference of private knowledge in large language models. All models, datasets, and code are fully open-sourced to support reproducibility and further research.

Technology Category

Application Category

📝 Abstract

We study secret elicitation: discovering knowledge that an AI possesses but does not explicitly verbalize. As a testbed, we train three families of large language models (LLMs) to possess specific knowledge that they apply downstream but deny knowing when asked directly. For example, in one setting, we train an LLM to generate replies that are consistent with knowing the user is female, while denying this knowledge when asked directly. We then design various black-box and white-box secret elicitation techniques and evaluate them based on whether they can help an LLM auditor successfully guess the secret knowledge. Many of our techniques improve on simple baselines. Our most effective techniques (performing best in 2/3 settings) are based on prefill attacks, a black-box technique where the LLM reveals secret knowledge when generating a completion from a predefined prefix. In our remaining setting, white-box techniques based on logit lens and sparse autoencoders (SAEs) are most effective. We release our models and code, establishing a public benchmark for evaluating secret elicitation methods.

Problem

Research questions and friction points this paper is trying to address.

Discovering hidden knowledge that AI models possess but do not explicitly state

Developing techniques to extract unverbalized information from large language models

Evaluating black-box and white-box methods for secret knowledge elicitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefill attacks reveal secrets via black-box completion

Logit lens techniques enable white-box secret extraction

Sparse autoencoders decode hidden knowledge from activations

🔎 Similar Papers

No similar papers found.