Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit robustness deficiencies even on seemingly simple tasks such as mathematical reasoning; existing evaluation methods rely on hand-crafted templates or fixed perturbation rules, rendering them vulnerable to data contamination. Method: We propose AR-Checker—the first automated stress-testing framework specifically designed to assess the robustness of LLMs in mathematical reasoning. Its core comprises multi-round parallel LLM-driven rewriting and semantic consistency verification to dynamically generate semantically equivalent yet failure-inducing problem variants. Contribution/Results: AR-Checker enables contamination-free, customizable benchmark generation, overcoming limitations of manual rule-based approaches. It introduces semantic preservation constraint modeling and cross-domain generalization adaptation techniques. Experiments demonstrate that AR-Checker significantly exposes model vulnerabilities on GSM8K and MATH-500, while also validating its generalizability on non-mathematical benchmarks—including MMLU, MMLU-Pro, and CommonsenseQA—thereby enhancing both coverage and credibility of robustness evaluation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have achieved distinguished performance on various reasoning-intensive tasks. However, LLMs might still face the challenges of robustness issues and fail unexpectedly in some simple reasoning tasks. Previous works evaluate the LLM robustness with hand-crafted templates or a limited set of perturbation rules, indicating potential data contamination in pre-training or fine-tuning datasets. In this work, inspired by stress testing in software engineering, we propose a novel framework, Automatic Robustness Checker (AR-Checker), to generate mathematical problem variants that maintain the semantic meanings of the original one but might fail the LLMs. The AR-Checker framework generates mathematical problem variants through multi-round parallel streams of LLM-based rewriting and verification. Our framework can generate benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the strong performance of AR-Checker on mathematical tasks. We also evaluate AR-Checker on benchmarks beyond mathematics, including MMLU, MMLU-Pro, and CommonsenseQA, where it also achieves strong performance, further proving the effectiveness of AR-Checker.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM robustness in mathematical problem-solving tasks
Detecting unexpected failures in simple reasoning tasks
Minimizing data contamination risks in LLM evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

AR-Checker framework for LLM robustness testing
Multi-round parallel LLM rewriting and verification
Dynamic benchmark generation to prevent data contamination
🔎 Similar Papers
No similar papers found.
Y
Yutao Hou
Shanghai University of Finance and Economics
Z
Zeguan Xiao
Shanghai University of Finance and Economics
F
Fei Yu
Ant Group
Yihan Jiang
Yihan Jiang
Amazon AGI
LLMLLM agentFederated Learning
Xuetao Wei
Xuetao Wei
Associate Professor, Southern University of Science and Technology
AI EthicsAI Safety
H
Hailiang Huang
Shanghai University of Finance and Economics
Y
Yun Chen
Shanghai University of Finance and Economics
G
Guanhua Chen
Southern University of Science and Technology