🤖 AI Summary
Existing mathematical evaluation benchmarks rely on manual construction, limiting scalability and incurring high costs—particularly for proof-based problems—thereby hindering deep assessment of large language models’ (LLMs’) mathematical reasoning capabilities.
Method: We propose Proof2Hybrid, the first fully automated framework for synthesizing high-quality, proof-centric mathematical evaluations. Its core is the Proof2X roadmap, which systematically transforms natural-language mathematical proofs into multiple verifiable question formats, introducing the novel “m-out-of-n multiple-true/false” hybrid item type to enhance robustness and mitigate guessing or superficial pattern matching. The framework integrates rule-guided NLP transformation with formal verification logic to realize an end-to-end automated pipeline.
Contribution/Results: Applying Proof2Hybrid, we construct AlgGeoTest—a 456-item algebraic geometry benchmark—demonstrating that state-of-the-art LLMs exhibit fundamental conceptual deficiencies in this domain, thereby revealing critical boundaries of their mathematical understanding.
📝 Abstract
Evaluating the mathematical capability of Large Language Models (LLMs) is a critical yet challenging frontier. Existing benchmarks fall short, particularly for proof-centric problems, as manual creation is unscalable and costly, leaving the true mathematical abilities of LLMs largely unassessed. To overcome these barriers, we propose Proof2Hybrid, the first fully automated framework that synthesizes high-quality, proof-centric benchmarks from natural language mathematical corpora. The key novelty of our solution is Proof2X, a roadmap of converting mathematical proofs into various kinds of questions that are easy to verify. Instructed by this roadmap, we propose a new type of hybrid-formatted questions, named ``$m$-out-of-$n$ multiple judge questions'', specifically designed to enable robust, automatic evaluation while being resilient to guessing and superficial pattern matching inherent in traditional formats. As a demonstration of our framework, we introduce AlgGeoTest, a benchmark for algebraic geometry--a frontier domain of modern mathematics--comprising 456 challenging items. Our extensive evaluations on state-of-the-art LLMs using AlgGeoTest reveal profound deficits in their comprehension of algebraic geometry, providing a more precise measure of their true mathematical capabilities. Our framework and benchmark pave the way for a new wave of in-depth research into the mathematical intelligence of AI systems.