Machine Learning meets Algebraic Combinatorics: A Suite of Datasets Capturing Research-level Conjecturing Ability in Pure Mathematics

📅 2025-03-09
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Existing machine learning datasets inadequately support AI-assisted open-ended research in professional mathematics—particularly in algebraic combinatorics—due to insufficient scale, structural richness, and formal verifiability. Method: We introduce the Algebraic Combinatorics Dataset Repository (ACD Repo), the first benchmark suite designed for cutting-edge mathematical research, covering nine unsolved problems with over one million structured, formally verifiable examples per problem. Our approach integrates supervised narrow-model training, model interpretability analysis, large language model–driven program synthesis, and symbolic encoding of combinatorial structures—establishing a novel “interpretable modeling + program synthesis” paradigm for conjecture generation. Contribution/Results: We release nine high-quality, reproducible datasets; substantially lower the barrier to AI-augmented original mathematical conjecturing; and empirically characterize the limits of neural models in abstract pattern induction—demonstrating both their capacity for nontrivial structural generalization and their systematic failures in higher-order combinatorial reasoning.

Technology Category

Application Category

📝 Abstract
With recent dramatic increases in AI system capabilities, there has been growing interest in utilizing machine learning for reasoning-heavy, quantitative tasks, particularly mathematics. While there are many resources capturing mathematics at the high-school, undergraduate, and graduate level, there are far fewer resources available that align with the level of difficulty and open endedness encountered by professional mathematicians working on open problems. To address this, we introduce a new collection of datasets, the Algebraic Combinatorics Dataset Repository (ACD Repo), representing either foundational results or open problems in algebraic combinatorics, a subfield of mathematics that studies discrete structures arising from abstract algebra. Further differentiating our dataset collection is the fact that it aims at the conjecturing process. Each dataset includes an open-ended research-level question and a large collection of examples (up to 10M in some cases) from which conjectures should be generated. We describe all nine datasets, the different ways machine learning models can be applied to them (e.g., training with narrow models followed by interpretability analysis or program synthesis with LLMs), and discuss some of the challenges involved in designing datasets like these.
Problem

Research questions and friction points this paper is trying to address.

Develop datasets for research-level conjecturing in pure mathematics
Address lack of resources matching professional mathematicians' difficulty
Focus on algebraic combinatorics with open-ended questions and examples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset collection for algebraic combinatorics research
Open-ended questions with large example sets
Machine learning models for conjecture generation
🔎 Similar Papers
No similar papers found.
H
Herman Chau
University of Washington
H
Helen Jenne
Pacific Northwest National Laboratory
Davis Brown
Davis Brown
University of Pennsylvania
deep learning
J
Jesse He
University of California, San Diego, Pacific Northwest National Laboratory
M
Mark Raugas
Pacific Northwest National Laboratory
Sara Billey
Sara Billey
University of Washington
Algebraic Combinatorics
Henry Kvinge
Henry Kvinge
Pacific Northwest National Lab/University of Washington
representation learningadversarial machine learninggeometric deep learningrepresentation theorycombinatorics