CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges

📅 2025-04-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fundamental deficiency of large language models (LLMs) in cryptographic decryption reasoning. To this end, we introduce CipherBank—the first comprehensive benchmark tailored for cryptographic reasoning—comprising 2,358 privacy-sensitive, real-world-inspired problems spanning five domains, fourteen subcategories, and nine encryption algorithms (including classical and custom ciphers). We propose a multidimensional cryptographic problem design framework and establish a principled, hierarchical evaluation methodology grounded in cryptographic theory. High-quality ciphertext–plaintext pairs are generated via human-crafted construction augmented by rule-based validation, and failure mechanisms are diagnosed through error-pattern analysis and causal attribution tracing. Experimental results reveal that state-of-the-art reasoning models—including o1 and DeepSeek-R1—achieve less than 32% accuracy on classical cipher decryption, markedly underperforming on mathematical and programming tasks. This gap exposes intrinsic limitations in symbolic manipulation, inverse modeling, and joint semantic-structural reasoning.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities, especially the recent advancements in reasoning, such as o1 and o3, pushing the boundaries of AI. Despite these impressive achievements in mathematics and coding, the reasoning abilities of LLMs in domains requiring cryptographic expertise remain underexplored. In this paper, we introduce CipherBank, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs in cryptographic decryption tasks. CipherBank comprises 2,358 meticulously crafted problems, covering 262 unique plaintexts across 5 domains and 14 subdomains, with a focus on privacy-sensitive and real-world scenarios that necessitate encryption. From a cryptographic perspective, CipherBank incorporates 3 major categories of encryption methods, spanning 9 distinct algorithms, ranging from classical ciphers to custom cryptographic techniques. We evaluate state-of-the-art LLMs on CipherBank, e.g., GPT-4o, DeepSeek-V3, and cutting-edge reasoning-focused models such as o1 and DeepSeek-R1. Our results reveal significant gaps in reasoning abilities not only between general-purpose chat LLMs and reasoning-focused LLMs but also in the performance of current reasoning-focused models when applied to classical cryptographic decryption tasks, highlighting the challenges these models face in understanding and manipulating encrypted data. Through detailed analysis and error investigations, we provide several key observations that shed light on the limitations and potential improvement areas for LLMs in cryptographic reasoning. These findings underscore the need for continuous advancements in LLM reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM reasoning in cryptographic decryption tasks
Assessing gaps between general and reasoning-focused LLMs
Exploring LLM limitations in understanding encrypted data
Innovation

Methods, ideas, or system contributions that make the work stand out.

CipherBank benchmark for LLM cryptographic reasoning
Includes 2358 problems with 9 encryption algorithms
Evaluates GPT-4o, DeepSeek-V3, and reasoning-focused models
🔎 Similar Papers
No similar papers found.