Large Language Models Are Struggle to Cope with Unreasonability in Math Problems

📅 2024-03-28
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit insufficient robustness when reasoning over irrational mathematical scenarios—such as internally inconsistent premises or false assumptions—yet no benchmark exists to systematically evaluate this capability. Method: We introduce the first Unreasonable Mathematics Problems (UMP) benchmark, comprising manually curated, multi-type irrational mathematical questions. It employs a zero-shot evaluation protocol coupled with a behavioral analysis framework to assess how 19 state-of-the-art LLMs identify and respond to such problems. Contribution/Results: We formally define and quantify LLM robustness in irrational mathematical reasoning for the first time. Our evaluation reveals widespread failure: even top-tier models like GPT-4o achieve only 0.6 accuracy; reasoning-specialized models (e.g., DeepSeek-R1) show unstable performance, exposing fundamental deficits in logical coherence assessment. This work fills a critical gap in LLM mathematical robustness evaluation and provides a novel diagnostic benchmark to advance logically consistent reasoning in foundation models.

Technology Category

Application Category

📝 Abstract
Recent research have demonstrated LLMs' impressive performance in math and reasoning. However, the capacity of LLMs to address math problems under unconventional conditions, such as internal inconsistencies and flawed assumptions, remains largely unexplored. In this paper, we propose a novel benchmark Unreasonable Math Problem (UMP) designed to assess LLMs' ability to recognize and respond to unreasonability in math problem. The benchmark consists of a carefully curated collection of unreasonable math questions across diverse types. Based on extensive experiments covering 19 LLMs, we observe that even state-of-the-art models such as GPT-4o achieve only limited performance of 0.6 in UMP, while reasoning models such as DeepSeek-R1 are prone to overthinking and unstable. We further explore strategies for improving the recognition of unreasonable inputs, shedding light on both the possibility and limitations of LLMs in this challenging setting.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to handle unreasonable math problems
Exploring LLMs' performance under unconventional math conditions
Improving recognition of flawed assumptions in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed Unreasonable Math Problem benchmark
Evaluated 19 LLMs on UMP benchmark
Explored strategies for recognizing unreasonable inputs
🔎 Similar Papers
No similar papers found.
J
Jingyuan Ma
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
D
Damai Dai
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Z
Zhifang Sui
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University