🤖 AI Summary
Current AI systems exhibit limited critical thinking, primarily resorting to passive refusal rather than proactively clarifying ambiguous, incomplete, or misleading inputs. Method: This paper introduces the “proactive critical thinking” paradigm, enabling models to autonomously initiate clarifying questions when encountering insufficient or deceptive information—thereby refining reasoning and enhancing human-AI collaborative problem-solving. To rigorously evaluate this capability, we introduce two novel benchmarks—GSM-MC and GSM-MCE—the first systematic evaluations of AI’s proactive clarification ability—built upon mathematical reasoning tasks. We further propose an end-to-end reinforcement learning framework, leveraging Qwen3 and Llama-series models. Results: Experiments demonstrate that Qwen3-1.7B achieves a dramatic accuracy improvement on GSM-MC—from 0.15% to 73.98%—validating both the efficacy and scalability of our paradigm. This work establishes a foundational framework for advancing AI from passive response toward proactive, cognitively collaborative intelligence.
📝 Abstract
Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning. However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests. In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information. GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions. Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones. However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B's accuracy from 0.15% to 73.98% on GSM-MC. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.