AI-Assisted Generation of Difficult Math Questions

📅 2024-07-30

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Expert item writing for mathematical problem generation is costly, and AI models struggle to simultaneously ensure difficulty and diversity. Method: This paper proposes a human-in-the-loop framework: (1) leveraging large language models’ metacognitive capabilities to automatically extract fine-grained mathematical skills from the MATH dataset; (2) designing a dual-skill random composition prompting mechanism to trigger out-of-distribution (OOD), high-difficulty problem generation; and (3) iteratively refining problems via multi-round LLM generation and human verification to construct the high-quality benchmark MATH². Contribution/Results: We introduce the first “dual-skill composition → OOD hard problem” generation paradigm. Empirically, model success rates on novel problems approximate the square of their original MATH accuracy—revealing a fundamental bottleneck in compositional skill reasoning. Experiments show MATH² substantially reduces performance of state-of-the-art models, and using MATH² examples as in-context demonstrations improves model accuracy on the original MATH benchmark.

Technology Category

Application Category

📝 Abstract

Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core"skills"from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an"out of distribution"task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH$^2$ - a dataset of higher-quality math questions, as evidenced by: (a) Lower performance of all models on MATH$^2$ than on MATH (b) Higher performance on MATH when using MATH$^2$ questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH$^2$ is the square on MATH, suggesting that successfully solving the question in MATH$^2$ requires a nontrivial combination of two distinct math skills.

Problem

Research questions and friction points this paper is trying to address.

Math Problem Generation

AI Model Training

Educational Assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language Models

Skill Extraction

Mathematical Problem Generation

🔎 Similar Papers

No similar papers found.