Verifier-Backed Hard Problem Generation for Mathematical Reasoning

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Current large language models struggle to automatically generate mathematical problems that are simultaneously valid, novel, and challenging, often relying on human intervention or producing invalid samples. This work proposes the VHG framework, which introduces an independent verifier for the first time, establishing a three-agent self-play mechanism among problem proposer, solver, and verifier. A joint reward function based on both validity and difficulty is designed to effectively mitigate reward hacking. The framework accommodates two implementations of the verifier—symbolic hard verifiers and LLM-based soft verifiers—and is applicable to tasks ranging from indefinite integration to general mathematical reasoning. Experimental results demonstrate that VHG significantly outperforms baseline methods across multiple mathematical tasks, yielding generated problems with markedly improved validity and challenge level.

📝 Abstract

Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training and enabling autonomous scientific research. Existing problem generation approaches either depend on expensive human expert involvement or adopt naive self-play paradigms, which frequently yield invalid problems due to reward hacking. This work introduces VHG, a verifier-enhanced hard problem generation framework built upon three-party self-play. By integrating an independent verifier into the conventional setter-solver duality, our design constrains the setter's reward to be jointly determined by problem validity (evaluated by the verifier) and difficulty (assessed by the solver). We instantiate two verifier variants: a Hard symbolic verifier and a Soft LLM-based verifier, with evaluations conducted on indefinite integral tasks and general mathematical reasoning tasks. Experimental results show that VHG substantially outperforms all baseline methods by a clear margin.

Problem

Research questions and friction points this paper is trying to address.

problem generation

mathematical reasoning

large language models

hard problems

validity

Innovation

Methods, ideas, or system contributions that make the work stand out.

verifier-enhanced problem generation

three-party self-play

reward hacking mitigation