MathConstruct: Challenging LLM Reasoning with Constructive Proofs

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing mathematical benchmarks suffer from oversimplified problems, unique answers, susceptibility to memorization or guessing, and narrow coverage—limiting their ability to assess deep constructive mathematical reasoning in large language models (LLMs). To address this, we propose MathConstruct, a novel benchmark comprising 126 competition-level constructive proof problems requiring explicit synthesis of mathematical objects satisfying specified properties. Methodologically, we establish automatically verifiable constructive proofs as the core evaluation paradigm, design a lightweight formal verifier, integrate human curation with structured annotation, and support automated problem variant generation for robustness assessment. Experiments reveal that state-of-the-art models solve only 54% of MathConstruct problems, exposing critical deficiencies in goal-directed constructive reasoning. MathConstruct thus establishes a more rigorous, scalable, and formally verifiable standard for evaluating mathematical reasoning capabilities in LLMs.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) demonstrate impressive performance in mathematics, existing math benchmarks come with significant limitations. Many focus on problems with fixed ground-truth answers, and are often saturated due to problem simplicity or the viability of guessing or memorization. Crucially, they capture only a narrow subset of relevant math problems. To address this research gap, we introduce mc, a new benchmark of 126 challenging problems sourced from various math competitions, which targets constructive proofs, a widely encountered problem type requiring the construction of mathematical objects with specific properties. These proofs are particularly suitable for LLM evaluation, as solution correctness can be easily verified. Our automated verifiers also enable MathConstruct to generate problem variations, used to evaluate robustness. State-of-the-art LLMs solve only 54% of MathConstruct problems, highlighting its complexity and importance for LLM evaluation.

Problem

Research questions and friction points this paper is trying to address.

Address limitations of current math benchmarks for LLMs

Introduce MathConstruct with 126 challenging math problems

Evaluate LLMs using constructive proofs and automated verifiers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructive proof benchmark

Automated verifiers integration

Problem variation generation

🔎 Similar Papers

No similar papers found.