Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Current large language model–driven theorem provers (e.g., for Lean) exhibit severe limitations in human-like compositional reasoning for mathematical inequalities—particularly in variable reuse, algebraic rewriting, and multi-step structured inference. Method: To systematically evaluate this capability, the authors introduce Ineq-Comp, the first benchmark targeting compositional generalization in inequality reasoning. It employs controlled symbolic transformations to generate hierarchically structured test instances. Contribution/Results: Experiments reveal substantial performance degradation across state-of-the-art provers (e.g., DeepSeek-Prover-V2-7B suffers a 20% drop in pass@32), persisting even when sub-proposition proofs are provided—indicating a fundamental deficit in compositional abstraction, not factual knowledge. This work is the first to rigorously expose AI’s intrinsic limitation in structured mathematical reasoning, establishing a novel evaluation paradigm and benchmark for assessing and advancing human-like deductive capabilities in theorem proving.

Technology Category

Application Category

📝 Abstract

LLM-based formal proof assistants (e.g., in Lean) hold great promise for automating mathematical discovery. But beyond syntactic correctness, do these systems truly understand mathematical structure as humans do? We investigate this question through the lens of mathematical inequalities -- a fundamental tool across many domains. While modern provers can solve basic inequalities, we probe their ability to handle human-intuitive compositionality. We introduce Ineq-Comp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers -- including Goedel, STP, and Kimina-7B -- struggle significantly. DeepSeek-Prover-V2-7B shows relative robustness -- possibly because it is trained to decompose the problems into sub-problems -- but still suffers a 20% performance drop (pass@32). Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition.

Problem

Research questions and friction points this paper is trying to address.

Assessing AI's human-like compositional reasoning in inequality proofs

Evaluating provers' ability to handle systematic inequality transformations

Identifying gaps between AI and human mathematical intuition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for human-intuitive inequality reasoning

Systematic transformations for compositional testing

Sub-problem decomposition training approach

🔎 Similar Papers

No similar papers found.