AICrypto: A Comprehensive Benchmark For Evaluating Cryptography Capabilities of Large Language Models

πŸ“… 2025-07-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Systematic evaluation of large language models’ (LLMs) capabilities in cryptography remains severely underexplored, with no comprehensive benchmark covering knowledge acquisition, reasoning, and practical application. Method: We introduce AICrypto, the first multi-level cryptography-specific benchmark, comprising three task categories: multiple-choice questions (knowledge recall), CTF-style challenges (vulnerability exploitation), and formal proofs (abstract mathematical reasoning). We innovatively incorporate human expert baselines and design an agent-based automated evaluation framework to ensure accuracy and scalability. Contribution/Results: Empirical evaluation across 17 mainstream LLMs reveals that top-performing models approach or surpass human experts on foundational concept recall and routine attack tasks, yet exhibit substantial limitations in dynamic analysis and higher-order mathematical reasoning. This work establishes a standardized, extensible evaluation infrastructure and delivers critical insights into the current state and bottlenecks of AI capabilities in cryptography.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across a variety of domains. However, their applications in cryptography, which serves as a foundational pillar of cybersecurity, remain largely unexplored. To address this gap, we propose extbf{AICrypto}, the first comprehensive benchmark designed to evaluate the cryptographic capabilities of LLMs. The benchmark comprises 135 multiple-choice questions, 150 capture-the-flag (CTF) challenges, and 18 proof problems, covering a broad range of skills from factual memorization to vulnerability exploitation and formal reasoning. All tasks are carefully reviewed or constructed by cryptography experts to ensure correctness and rigor. To support automated evaluation of CTF challenges, we design an agent-based framework. To gain deeper insight into the current state of cryptographic proficiency in LLMs, we introduce human expert performance baselines for comparison across all task types. Our evaluation of 17 leading LLMs reveals that state-of-the-art models match or even surpass human experts in memorizing cryptographic concepts, exploiting common vulnerabilities, and routine proofs. However, they still lack a deep understanding of abstract mathematical concepts and struggle with tasks that require multi-step reasoning and dynamic analysis. We hope this work could provide insights for future research on LLMs in cryptographic applications. Our code and dataset are available at https://aicryptobench.github.io.
Problem

Research questions and friction points this paper is trying to address.

Evaluating cryptographic capabilities of large language models
Assessing LLMs in memorization, vulnerability exploitation, and proofs
Identifying gaps in abstract math understanding and multi-step reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for LLM cryptography evaluation
Agent-based framework for automated CTF challenge assessment
Human expert baselines for performance comparison
πŸ”Ž Similar Papers
No similar papers found.
Y
Yu Wang
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences; Shanghai Qi Zhi Institute
Y
Yijian Liu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
L
Liheng Ji
Shanghai Qi Zhi Institute; Institute for Interdisciplinary Information Sciences, Tsinghua University
H
Han Luo
Institute for Interdisciplinary Information Sciences, Tsinghua University
W
Wenjie Li
Institute for Interdisciplinary Information Sciences, Tsinghua University; Shanghai Qi Zhi Institute
Xiaofei Zhou
Xiaofei Zhou
Shanghai Jiao Tong University
Human-Computer InteractionEducational TechnologyAI EducationAugmented RealityLearning
C
Chiyun Feng
School of Cyber Security, University of Chinese Academy of Sciences
P
Puji Wang
School of Cyber Security, University of Chinese Academy of Sciences
Y
Yuhan Cao
Shanghai Qi Zhi Institute
G
Geyuan Zhang
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
X
Xiaojian Li
College of AI, Tsinghua University; Shanghai Qi Zhi Institute
Rongwu Xu
Rongwu Xu
University of Washington
AIHuman-AINLPRL
Y
Yilei Chen
Institute for Interdisciplinary Information Sciences, Tsinghua University; Shanghai Qi Zhi Institute
Tianxing He
Tianxing He
Tsinghua University
NLP