Benchmarking Large Language Models via Random Variables

📅 2025-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mathematical reasoning benchmarks suffer from oversimplified problems and information leakage, failing to reliably assess large language models’ (LLMs) logical comprehension. To address this, we propose RV-Bench, the first benchmark introducing a random-variable perturbation paradigm: by stochastically reparameterizing variables in mathematical problems, we construct over 900 generalization-oriented test instances, evaluating robustness of reasoning logic rather than static answer matching. Our methodology integrates templated problem modeling, cross-model consistency analysis, and precision decay assessment under increasing perturbation intensity. Comprehensive evaluation across 29 state-of-the-art LLMs reveals substantial performance degradation on complex reasoning tasks—highlighting critical gaps in current capabilities. RV-Bench introduces the first leaderboard explicitly focused on core reasoning robustness, offering a more reliable, discriminative, and theoretically grounded standard for mathematical reasoning evaluation.

Technology Category

Application Category

📝 Abstract
With the continuous advancement of large language models (LLMs) in mathematical reasoning, evaluating their performance in this domain has become a prominent research focus. Recent studies have raised concerns about the reliability of current mathematical benchmarks, highlighting issues such as simplistic design and potential data leakage. Therefore, creating a reliable benchmark that effectively evaluates the genuine capabilities of LLMs in mathematical reasoning remains a significant challenge. To address this, we propose RV-Bench, a framework for Benchmarking LLMs via Random Variables in mathematical reasoning. Specifically, the background content of a random variable question (RV question) mirrors the original problem in existing standard benchmarks, but the variable combinations are randomized into different values. LLMs must fully understand the problem-solving process for the original problem to correctly answer RV questions with various combinations of variable values. As a result, the LLM's genuine capability in mathematical reasoning is reflected by its accuracy on RV-Bench. Extensive experiments are conducted with 29 representative LLMs across 900+ RV questions. A leaderboard for RV-Bench ranks the genuine capability of these LLMs. Further analysis of accuracy dropping indicates that current LLMs still struggle with complex mathematical reasoning problems.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Mathematical Reasoning
Evaluation Methodology
Innovation

Methods, ideas, or system contributions that make the work stand out.

RV-Bench
mathematical reasoning
random variables
🔎 Similar Papers
No similar papers found.
Zijin Hong
Zijin Hong
The Hong Kong Polytechnic University
Text-to-SQLLarge Language ModelsNatural Language Processing
H
Hao Wu
The Hong Kong Polytechnic University, University of Electronic Science and Technology of China
Su Dong
Su Dong
Ant Group
nlp
Junnan Dong
Junnan Dong
Tencent Youtu Lab | HKPolyU
Large Language ModelsGraphRAGAgentKnowledge Graphs
Y
Yilin Xiao
The Hong Kong Polytechnic University
Y
Yujing Zhang
The Hong Kong Polytechnic University
Z
Zhu Wang
The Hong Kong Polytechnic University
Feiran Huang
Feiran Huang
Professor, Jinan University
Recommender systemsText-to-SQLSentiment AnalysisLLMsMultimodal Learning
L
Linyi Li
Simon Fraser University
Hongxia Yang
Hongxia Yang
Professor, HK Polytechnic University
Machine LearningGenerative AICognitive IntelligenceStatistical Modeling
X
Xiao Huang
The Hong Kong Polytechnic University