🤖 AI Summary
Large language models (LLMs) exhibit insufficient robustness when reasoning over irrational mathematical scenarios—such as internally inconsistent premises or false assumptions—yet no benchmark exists to systematically evaluate this capability.
Method: We introduce the first Unreasonable Mathematics Problems (UMP) benchmark, comprising manually curated, multi-type irrational mathematical questions. It employs a zero-shot evaluation protocol coupled with a behavioral analysis framework to assess how 19 state-of-the-art LLMs identify and respond to such problems.
Contribution/Results: We formally define and quantify LLM robustness in irrational mathematical reasoning for the first time. Our evaluation reveals widespread failure: even top-tier models like GPT-4o achieve only 0.6 accuracy; reasoning-specialized models (e.g., DeepSeek-R1) show unstable performance, exposing fundamental deficits in logical coherence assessment. This work fills a critical gap in LLM mathematical robustness evaluation and provides a novel diagnostic benchmark to advance logically consistent reasoning in foundation models.
📝 Abstract
Recent research have demonstrated LLMs' impressive performance in math and reasoning. However, the capacity of LLMs to address math problems under unconventional conditions, such as internal inconsistencies and flawed assumptions, remains largely unexplored. In this paper, we propose a novel benchmark Unreasonable Math Problem (UMP) designed to assess LLMs' ability to recognize and respond to unreasonability in math problem. The benchmark consists of a carefully curated collection of unreasonable math questions across diverse types. Based on extensive experiments covering 19 LLMs, we observe that even state-of-the-art models such as GPT-4o achieve only limited performance of 0.6 in UMP, while reasoning models such as DeepSeek-R1 are prone to overthinking and unstable. We further explore strategies for improving the recognition of unreasonable inputs, shedding light on both the possibility and limitations of LLMs in this challenging setting.