🤖 AI Summary
This study investigates whether large language models (LLMs) align with humans in judging the “interestingness” and “difficulty” of mathematical problems, particularly across groups with divergent mathematical expertise—crowdworkers versus International Mathematical Olympiad (IMO) participants.
Method: We conduct the first systematic empirical comparison between human subjective assessments and outputs from multiple LLMs, quantifying distributional alignment and correlating model-generated rationales with human-elicited justifications.
Contribution/Results: While LLMs exhibit coarse discrimination between interesting and uninteresting problems, they fail to replicate human judgment distributions; their generated explanations for interestingness show significantly weak correlation with human rationales. Crucially, LLMs also fail to capture the systematic judgment divergence between experts and non-experts. These findings expose fundamental limitations of current LLMs in modeling mathematical cognition, establish critical boundaries for deploying AI as an educational thinking partner, and introduce the first benchmark framework for human–model alignment on mathematical interestingness.
📝 Abstract
The evolution of mathematics has been guided in part by interestingness. From researchers choosing which problems to tackle next, to students deciding which ones to engage with, people's choices are often guided by judgments about how interesting or challenging problems are likely to be. As AI systems, such as LLMs, increasingly participate in mathematics with people -- whether for advanced research or education -- it becomes important to understand how well their judgments align with human ones. Our work examines this alignment through two empirical studies of human and LLM assessment of mathematical interestingness and difficulty, spanning a range of mathematical experience. We study two groups: participants from a crowdsourcing platform and International Math Olympiad competitors. We show that while many LLMs appear to broadly agree with human notions of interestingness, they mostly do not capture the distribution observed in human judgments. Moreover, most LLMs only somewhat align with why humans find certain math problems interesting, showing weak correlation with human-selected interestingness rationales. Together, our findings highlight both the promises and limitations of current LLMs in capturing human interestingness judgments for mathematical AI thought partnerships.