All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work reveals a critical vulnerability in large language models (LLMs): their reasoning capabilities degrade severely on encrypted text, posing a potential bypass threat to chain-of-thought (CoT) monitoring. Method: We systematically evaluate the encrypted reasoning performance of 10 mainstream LLMs across 28 ciphers, using mathematical problem-solving accuracy and decryption fidelity as dual evaluation metrics. Contribution/Results: We empirically establish—for the first time—that encrypted reasoning capability strongly correlates with cipher prevalence in pretraining corpora, exhibiting clear scalability patterns: robustness persists only for high-frequency ciphers (e.g., ROT13), while fine-tuning yields marginal gains. Crucially, we observe a pronounced asymmetry—models often successfully decode ciphertext yet fail to reason over it. These findings demonstrate that current CoT supervision is fragile against encoded reasoning traces, underscoring an urgent need for novel, robust mechanisms for reasoning oversight.

Technology Category

Application Category

📝 Abstract

Detecting harmful AI actions is important as AI agents gain adoption. Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment. However, attackers and misaligned models might evade CoT monitoring through ciphered reasoning: reasoning hidden in encrypted, translated, or compressed text. To assess this risk, we test whether models can perform ciphered reasoning. For each of 28 different ciphers, we fine-tune and prompt up to 10 models to reason in that cipher. We measure model accuracy on math problems as a proxy for reasoning ability. Across the models we test, we find an asymmetry: model accuracy can drop significantly when reasoning in ciphered text, even though models demonstrate comprehension of ciphered text by being able to translate it accurately to English. Even frontier models struggle with lesser-known ciphers, although they can reason accurately in well-known ciphers like rot13. We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models.

Problem

Research questions and friction points this paper is trying to address.

Assessing AI models' reasoning ability in encrypted or ciphered languages

Testing whether models can evade chain-of-thought monitoring through hidden reasoning

Evaluating performance drop in ciphered reasoning despite text comprehension

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned models on 28 ciphers for reasoning

Measured math accuracy as reasoning proxy

Identified scaling laws for cipher capability

🔎 Similar Papers

No similar papers found.