Evaluating Language Model Reasoning about Confidential Information

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work investigates language models’ ability to adhere to user-defined safety constraints—such as password-based authorization—in high-stakes settings, revealing that their internal reasoning traces can inadvertently leak confidential information, thereby challenging the implicit assumption that “user-visible reasoning trajectories imply safety.” Method: We introduce PasswordEval, the first benchmark tailored for context-sensitive authorization tasks, integrating adversarial jailbreaking injections, multi-turn dialogue modeling, and fine-grained reasoning trace analysis to systematically evaluate leading open- and closed-source models. Results: Experiments demonstrate severe robustness deficiencies in state-of-the-art models on authorization decisions; counterintuitively, stronger reasoning capabilities exacerbate confidential information leakage. This study provides the first empirical evidence of a negative correlation between reasoning strength and security assurance. It establishes a novel, reproducible evaluation paradigm for trustworthy autonomous agents, advancing safety assessment beyond conventional input-output correctness toward transparent, constraint-compliant reasoning.

Technology Category

Application Category

📝 Abstract

As language models are increasingly deployed as autonomous agents in high-stakes settings, ensuring that they reliably follow user-defined rules has become a critical safety concern. To this end, we study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications. For this analysis, we develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized (i.e., with a correct password). We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance. In fact, we find that reasoning traces frequently leak confidential information, which calls into question whether reasoning traces should be exposed to users in such applications. We also scale the difficulty of our evaluation along multiple axes: (i) by adding adversarial user pressure through various jailbreaking strategies, and (ii) through longer multi-turn conversations where password verification is more challenging. Overall, our results suggest that current frontier models are not well-suited to handling confidential information, and that reasoning capabilities may need to be trained in a different manner to make them safer for release in high-stakes settings.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LM reasoning about confidential information adherence

Assessing contextual robustness to safety specifications

Testing password-based authorization in user requests

Innovation

Methods, ideas, or system contributions that make the work stand out.

PasswordEval benchmark for authorization testing

Evaluating models under adversarial jailbreaking strategies

Testing multi-turn conversation password verification

🔎 Similar Papers

Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions

2024-08-10arXiv.orgCitations: 3

💼 Related Jobs

Researcher, Safety & Privacy

OpenAI

$295K – $445K • Offers Equity

San Francisco

Authors to Follow