CryptoX : Compositional Reasoning Evaluation of Large Language Models

📅 2025-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reasoning benchmarks inadequately quantify large language models’ (LLMs) compositional reasoning capability—the ability to decompose multi-step, cross-task subproblems, perform coordinated inference, and integrate conclusions. Method: We propose CryptoBench, the first evaluation framework integrating cryptographic principles (e.g., RSA factorization, zero-knowledge proof simulation) with multi-source reasoning benchmarks to establish a compositional reasoning assessment paradigm. It employs multi-benchmark fusion evaluation and mechanistic interpretability analysis—including neuron activation tracing and modular attribution—to dissect LLM behavior across subproblem decoupling, logical chain construction, and result integration. Contribution/Results: Experiments across leading open- and closed-weight models reveal substantial capability gaps in compositional reasoning, strongly correlating with intelligence emergence. CryptoBench provides an interpretable diagnostic tool and actionable optimization pathways for advancing LLM reasoning capabilities.

Technology Category

Application Category

📝 Abstract
The compositional reasoning capacity has long been regarded as critical to the generalization and intelligence emergence of large language models LLMs. However, despite numerous reasoning-related benchmarks, the compositional reasoning capacity of LLMs is rarely studied or quantified in the existing benchmarks. In this paper, we introduce CryptoX, an evaluation framework that, for the first time, combines existing benchmarks and cryptographic, to quantify the compositional reasoning capacity of LLMs. Building upon CryptoX, we construct CryptoBench, which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. We further conduct thorough mechanical interpretability experiments to reveal the inner mechanism of LLMs' compositional reasoning, involving subproblem decomposition, subproblem inference, and summarizing subproblem conclusions. Through analysis based on CryptoBench, we highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning capabilities of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Quantifies compositional reasoning in LLMs
Compares open-source and closed-source LLMs
Explores mechanisms behind compositional reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines benchmarks with cryptography
Introduces CryptoBench for evaluation
Focuses on compositional reasoning mechanisms
🔎 Similar Papers
No similar papers found.
J
Jiajun Shi
M-A-P; Beihang University
C
Chaoren Wei
M-A-P; Beihang University
L
Liqun Yang
Beihang University
Zekun Moore Wang
Zekun Moore Wang
KlingAI at Kuaishou Technology
MultimodalNatural Language ProcessingLarge Language ModelsGenerative AI
Chenghao Yang
Chenghao Yang
University of Chicago
Human-AI AlignmentNLPMLCommunication & Intelligence
G
Ge Zhang
M-A-P; ByteDance.Inc
S
Stephen Huang
M-A-P; ByteDance.Inc
Tao Peng
Tao Peng
吉林大学
natural language processingknowledge graph
J
Jian Yang
Beihang University
Zhoufutu Wen
Zhoufutu Wen
ByteDance SEED
LLM Evaluation