🤖 AI Summary
Existing mathematical benchmarks suffer from oversimplified problems, unique answers, susceptibility to memorization or guessing, and narrow coverage—limiting their ability to assess deep constructive mathematical reasoning in large language models (LLMs). To address this, we propose MathConstruct, a novel benchmark comprising 126 competition-level constructive proof problems requiring explicit synthesis of mathematical objects satisfying specified properties. Methodologically, we establish automatically verifiable constructive proofs as the core evaluation paradigm, design a lightweight formal verifier, integrate human curation with structured annotation, and support automated problem variant generation for robustness assessment. Experiments reveal that state-of-the-art models solve only 54% of MathConstruct problems, exposing critical deficiencies in goal-directed constructive reasoning. MathConstruct thus establishes a more rigorous, scalable, and formally verifiable standard for evaluating mathematical reasoning capabilities in LLMs.
📝 Abstract
While Large Language Models (LLMs) demonstrate impressive performance in mathematics, existing math benchmarks come with significant limitations. Many focus on problems with fixed ground-truth answers, and are often saturated due to problem simplicity or the viability of guessing or memorization. Crucially, they capture only a narrow subset of relevant math problems. To address this research gap, we introduce mc, a new benchmark of 126 challenging problems sourced from various math competitions, which targets constructive proofs, a widely encountered problem type requiring the construction of mathematical objects with specific properties. These proofs are particularly suitable for LLM evaluation, as solution correctness can be easily verified. Our automated verifiers also enable MathConstruct to generate problem variations, used to evaluate robustness. State-of-the-art LLMs solve only 54% of MathConstruct problems, highlighting its complexity and importance for LLM evaluation.