Large Language Models Are Struggle to Cope with Unreasonability in Math Problems

📅 2024-03-28

📈 Citations: 4

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Large language models (LLMs) exhibit insufficient robustness when reasoning over irrational mathematical scenarios—such as internally inconsistent premises or false assumptions—yet no benchmark exists to systematically evaluate this capability. Method: We introduce the first Unreasonable Mathematics Problems (UMP) benchmark, comprising manually curated, multi-type irrational mathematical questions. It employs a zero-shot evaluation protocol coupled with a behavioral analysis framework to assess how 19 state-of-the-art LLMs identify and respond to such problems. Contribution/Results: We formally define and quantify LLM robustness in irrational mathematical reasoning for the first time. Our evaluation reveals widespread failure: even top-tier models like GPT-4o achieve only 0.6 accuracy; reasoning-specialized models (e.g., DeepSeek-R1) show unstable performance, exposing fundamental deficits in logical coherence assessment. This work fills a critical gap in LLM mathematical robustness evaluation and provides a novel diagnostic benchmark to advance logically consistent reasoning in foundation models.

Technology Category

Application Category

📝 Abstract

Recent research have demonstrated LLMs' impressive performance in math and reasoning. However, the capacity of LLMs to address math problems under unconventional conditions, such as internal inconsistencies and flawed assumptions, remains largely unexplored. In this paper, we propose a novel benchmark Unreasonable Math Problem (UMP) designed to assess LLMs' ability to recognize and respond to unreasonability in math problem. The benchmark consists of a carefully curated collection of unreasonable math questions across diverse types. Based on extensive experiments covering 19 LLMs, we observe that even state-of-the-art models such as GPT-4o achieve only limited performance of 0.6 in UMP, while reasoning models such as DeepSeek-R1 are prone to overthinking and unstable. We further explore strategies for improving the recognition of unreasonable inputs, shedding light on both the possibility and limitations of LLMs in this challenging setting.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to handle unreasonable math problems

Exploring LLMs' performance under unconventional math conditions

Improving recognition of flawed assumptions in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed Unreasonable Math Problem benchmark

Evaluated 19 LLMs on UMP benchmark

Explored strategies for recognizing unreasonable inputs

🔎 Similar Papers

No similar papers found.