HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM mathematical benchmarks emphasize exact solutions or formal proofs, overlooking pervasive approximate modeling tasks in applied sciences. Method: We introduce HARDMath2—a high-quality, expert-curated benchmark focused on asymptotic analysis and applied mathematics—comprising 211 original problems spanning boundary layer theory, the WKB method, and asymptotic solutions to nonlinear PDEs. Developed collaboratively by Harvard faculty and students, it employs a novel “student-led, human–model interactive” construction paradigm: difficult problems are reverse-engineered from LLM failure cases and refined via human authoring, peer verification, automated LLM solving, numerical solution validation, and asymptotic modeling verification. Contribution/Results: State-of-the-art LLMs perform poorly on HARDMath2, revealing critical gaps in asymptotic reasoning. Notably, students deepened their own mathematical understanding by diagnosing model errors. HARDMath2 fills a fundamental gap in evaluating LLMs’ applied mathematical competence and establishes a new paradigm for rigorous, pedagogically informed reasoning assessment.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown remarkable progress in mathematical problem-solving, but evaluation has largely focused on problems that have exact analytical solutions or involve formal proofs, often overlooking approximation-based problems ubiquitous in applied science and engineering. To fill this gap, we build on prior work and present HARDMath2, a dataset of 211 original problems covering the core topics in an introductory graduate applied math class, including boundary-layer analysis, WKB methods, asymptotic solutions of nonlinear partial differential equations, and the asymptotics of oscillatory integrals. This dataset was designed and verified by the students and instructors of a core graduate applied mathematics course at Harvard. We build the dataset through a novel collaborative environment that challenges students to write and refine difficult problems consistent with the class syllabus, peer-validate solutions, test different models, and automatically check LLM-generated solutions against their own answers and numerical ground truths. Evaluation results show that leading frontier models still struggle with many of the problems in the dataset, highlighting a gap in the mathematical reasoning skills of current LLMs. Importantly, students identified strategies to create increasingly difficult problems by interacting with the models and exploiting common failure modes. This back-and-forth with the models not only resulted in a richer and more challenging benchmark but also led to qualitative improvements in the students' understanding of the course material, which is increasingly important as we enter an age where state-of-the-art language models can solve many challenging problems across a wide domain of fields.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on approximation-based applied math problems
Creating a collaborative dataset for graduate applied math
Identifying LLM failure modes in mathematical reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative student-instructor problem design
Peer-validated solutions and model testing
Automated LLM solution verification system
🔎 Similar Papers
J
James V. Roggeveen
School of Engineering and Applied Sciences, Harvard University
Erik Y. Wang
Erik Y. Wang
University of Oxford, Harvard University
W
Will Flintoft
School of Engineering and Applied Sciences, Harvard University
P
Peter Donets
School of Engineering and Applied Sciences, Harvard University
L
Lucy S. Nathwani
School of Engineering and Applied Sciences, Harvard University
N
Nickholas Gutierrez
School of Engineering and Applied Sciences, Harvard University
D
David Ettel
School of Engineering and Applied Sciences, Harvard University
A
Anton Marius Graf
School of Engineering and Applied Sciences, Harvard University
S
Siddharth Dandavate
School of Engineering and Applied Sciences, Harvard University
A
Arjun Nageswaran
School of Engineering and Applied Sciences, Harvard University
R
Raglan Ward
School of Engineering and Applied Sciences, Harvard University
A
Ava Williamson
School of Engineering and Applied Sciences, Harvard University
A
Anne Mykland
School of Engineering and Applied Sciences, Harvard University
K
Kacper K. Migacz
School of Engineering and Applied Sciences, Harvard University
Y
Yijun Wang
School of Engineering and Applied Sciences, Harvard University
E
Egemen Bostan
School of Engineering and Applied Sciences, Harvard University
D
Duy Thuc Nguyen
School of Engineering and Applied Sciences, Harvard University
Zhe He
Zhe He
University of Macau
deep learningreinforcement learningPOMDPs
M
Marc L. Descoteaux
School of Engineering and Applied Sciences, Harvard University
F
Felix Yeung
School of Engineering and Applied Sciences, Harvard University
S
Shida Liu
School of Engineering and Applied Sciences, Harvard University
J
Jorge García Ponce
School of Engineering and Applied Sciences, Harvard University
L
Luke Zhu
School of Engineering and Applied Sciences, Harvard University
Y
Yuyang Chen
School of Engineering and Applied Sciences, Harvard University
E
Ekaterina S. Ivshina
School of Engineering and Applied Sciences, Harvard University
M
Miguel Fernandez
School of Engineering and Applied Sciences, Harvard University
M
Minjae Kim
School of Engineering and Applied Sciences, Harvard University
K
Kennan Gumbs
School of Engineering and Applied Sciences, Harvard University
M
Matthew Scott Tan
School of Engineering and Applied Sciences, Harvard University
R
Russell Yang
School of Engineering and Applied Sciences, Harvard University
M
Mai Hoang
School of Engineering and Applied Sciences, Harvard University
D
David Brown
School of Engineering and Applied Sciences, Harvard University
I
Isabella A. Silveira
School of Engineering and Applied Sciences, Harvard University
L
Lavon Sykes
School of Engineering and Applied Sciences, Harvard University
Ahmed Roman
Ahmed Roman
Dana Farber, Broad Institute, Harvard Medical School
CancerMachine/Animal learningInformation TheoryNon-equilibrium Statistical Mechanics.
W
William Fredenberg
School of Engineering and Applied Sciences, Harvard University
Y
Yiming Chen
School of Engineering and Applied Sciences, Harvard University
L
Lucas Martin
School of Engineering and Applied Sciences, Harvard University
Y
Yixing Tang
School of Engineering and Applied Sciences, Harvard University
K
Kelly Werker Smith
School of Engineering and Applied Sciences, Harvard University
H
Hongyu Liao
School of Engineering and Applied Sciences, Harvard University
L
Logan G. Wilson
School of Engineering and Applied Sciences, Harvard University
A
Alexander Dazhen Cai
School of Engineering and Applied Sciences, Harvard University
A
Andrea Elizabeth Biju
School of Engineering and Applied Sciences, Harvard University
Michael P. Brenner
Michael P. Brenner
Harvard University