🤖 AI Summary
Large reasoning models (LRMs) lack proactive question-asking capabilities when confronting incomplete mathematical problems—a critical gap in current evaluation frameworks. Method: We construct the first multi-scenario dataset of incomplete mathematical problems and propose a novel three-dimensional evaluation paradigm centered on “proactivity,” “over-reasoning,” and “hallucination.” Through qualitative and quantitative analyses, we systematically characterize LRM behaviors, revealing widespread avoidance of questioning, tendencies toward over-reasoning, and hallucinatory answer generation. We further investigate supervised fine-tuning (SFT) for enhancing question-asking ability, finding marginal improvements but no fundamental resolution of proactivity deficiency. Contribution/Results: This work pioneers the integration of proactive information seeking into LRM evaluation, establishing a theoretical foundation, standardized benchmark, and actionable improvement pathways toward AI agents with authentic interactive intelligence.
📝 Abstract
Large Reasoning Models (LRMs) have demonstrated remarkable problem-solving abilities in mathematics, as evaluated by existing benchmarks exclusively on well-defined problems. However, such evaluation setup constitutes a critical gap, since a genuine intelligent agent should not only solve problems (as a math quiz solver), but also be able~to ask for information when the problems lack sufficient information, enabling proactivity in responding users' requests. To bridge such gap, we proposes a new dataset consisting of two types of incomplete problems with diverse contexts. Based on the dataset, our systematical evaluation of LRMs reveals their inability in proactively asking for information. In addition, we uncover the behaviors related to overthinking and hallucination of LRMs, and highlight the potential and challenges of supervised fine-tuning in learning such ability. We hope to provide new insights in developing LRMs with genuine intelligence, rather than just solving problems.