🤖 AI Summary
This survey addresses the challenges of applying large language models (LLMs) to mathematical reasoning, formal theorem proving, and educational applications. We propose the first unified framework categorizing five core technical paradigms: instruction fine-tuning, tool augmentation (e.g., Python interpreters and interactive theorem provers), foundational and advanced chain-of-thought (CoT) prompting, multimodal modeling, and hybrid approaches. To enable systematic evaluation, we construct a taxonomy covering 60+ benchmark datasets and quantitatively analyze over 100 works across performance, methodology, and architectural characteristics. Our analysis identifies fundamental bottlenecks—including scalability limitations, misalignment between natural-language reasoning and formal logic, and poor out-of-distribution generalization—and proposes concrete, actionable research directions. All curated resources—including annotated dataset inventories, evaluation metrics, and implementation references—are publicly released. This work delivers both a comprehensive technical roadmap and empirically grounded benchmarks for advancing mathematically capable foundation models.
📝 Abstract
In recent years, there has been remarkable progress in leveraging Language Models (LMs), encompassing Pre-trained Language Models (PLMs) and Large-scale Language Models (LLMs), within the domain of mathematics. This paper conducts a comprehensive survey of mathematical LMs, systematically categorizing pivotal research endeavors from two distinct perspectives: tasks and methodologies. The landscape reveals a large number of proposed mathematical LLMs, which are further delineated into instruction learning, tool-based methods, fundamental CoT techniques, advanced CoT methodologies and multi-modal methods. To comprehend the benefits of mathematical LMs more thoroughly, we carry out an in-depth contrast of their characteristics and performance. In addition, our survey entails the compilation of over 60 mathematical datasets, including training datasets, benchmark datasets, and augmented datasets. Addressing the primary challenges and delineating future trajectories within the field of mathematical LMs, this survey is poised to facilitate and inspire future innovation among researchers invested in advancing this domain.