EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing mathematical reasoning benchmarks suffer from score saturation, temporal decay, and data contamination, hindering sustained evaluation of large language models’ (LLMs) reasoning capabilities. To address this, we propose the first evolvable mathematical evaluation framework based on evolutionary testing: it generates algebraically verifiable seed problems via reverse engineering, introduces cognitive challenges through multidimensional genetic operators, and dynamically assesses problem difficulty using a composite fitness function. This framework continuously produces uncontaminated, high-difficulty problems, effectively mitigating model overfitting. Empirical analysis reveals a prevalent “pseudo-insight” phenomenon in LLMs—reliance on heuristic shortcuts yields superficially plausible yet logically flawed reasoning. On GSM8K, mainstream models exhibit a 48% average accuracy drop, clearly exposing and quantifying their non-rigorous reasoning deficiencies.

Technology Category

Application Category

📝 Abstract

The rapid advancement of LLMs poses a significant challenge to existing mathematical reasoning benchmarks. These benchmarks commonly suffer from issues such as score saturation, temporal decay, and data contamination. To address this challenge, this paper introduces EvolMathEval, an automated mathematical benchmark generation and evolution framework based on evolutionary testing. By dynamically generating unique evaluation instances ab initio, the framework fundamentally eliminates the risk of data contamination, and ensuring the benchmark remains perpetually challenging for future models.The core mechanisms of EvolMathEval include: seed problem generation based on reverse engineering with algebraic guarantees; multi-dimensional genetic operators designed to inject diverse cognitive challenges; and a composite fitness function that can rapidly and accurately assess problem difficulty. Experimental results demonstrate that the proposed composite fitness function can efficiently and precisely quantify the difficulty of mathematical problems. Furthermore, EvolMathEval can not only generate a large volume of high-difficulty problems through continuous self-iteration, but it can also significantly enhance the complexity of public datasets like GSM8K through evolution, reducing model accuracy by an average of 48%. Deeper investigation reveals that when solving these evolved, complex problems, LLMs tend to employ non-rigorous heuristics to bypass complex multi-step logical reasoning, consequently leading to incorrect solutions. We define this phenomenon as "Pseudo Aha Moment". This finding uncovers a cognitive shortcut-taking behavior in the deep reasoning processes of current LLMs, which we find accounts for 77% to 100% of errors on targeted problems. Code and resources are available at:https://github.com/SYSUSELab/EvolMathEval.

Problem

Research questions and friction points this paper is trying to address.

Addresses score saturation and data contamination in math benchmarks

Introduces automated benchmark generation using evolutionary testing

Reveals LLM heuristic shortcuts in complex reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated benchmark generation via evolutionary testing

Multi-dimensional genetic operators for diverse challenges

Composite fitness function to assess problem difficulty

🔎 Similar Papers

LLMs for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages