🤖 AI Summary
Current large language model–driven theorem provers (e.g., for Lean) exhibit severe limitations in human-like compositional reasoning for mathematical inequalities—particularly in variable reuse, algebraic rewriting, and multi-step structured inference.
Method: To systematically evaluate this capability, the authors introduce Ineq-Comp, the first benchmark targeting compositional generalization in inequality reasoning. It employs controlled symbolic transformations to generate hierarchically structured test instances.
Contribution/Results: Experiments reveal substantial performance degradation across state-of-the-art provers (e.g., DeepSeek-Prover-V2-7B suffers a 20% drop in pass@32), persisting even when sub-proposition proofs are provided—indicating a fundamental deficit in compositional abstraction, not factual knowledge. This work is the first to rigorously expose AI’s intrinsic limitation in structured mathematical reasoning, establishing a novel evaluation paradigm and benchmark for assessing and advancing human-like deductive capabilities in theorem proving.
📝 Abstract
LLM-based formal proof assistants (e.g., in Lean) hold great promise for automating mathematical discovery. But beyond syntactic correctness, do these systems truly understand mathematical structure as humans do? We investigate this question through the lens of mathematical inequalities -- a fundamental tool across many domains. While modern provers can solve basic inequalities, we probe their ability to handle human-intuitive compositionality. We introduce Ineq-Comp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers -- including Goedel, STP, and Kimina-7B -- struggle significantly. DeepSeek-Prover-V2-7B shows relative robustness -- possibly because it is trained to decompose the problems into sub-problems -- but still suffers a 20% performance drop (pass@32). Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition.