🤖 AI Summary
Large language models (LLMs) exhibit unreliable performance in detecting mathematical reasoning errors, and existing prompting methods suffer from poor generalization. Method: We propose Pedagogical Chain-of-Thought (PedCoT), the first zero-shot error detection framework that systematically integrates Bloom’s Taxonomy into prompt engineering. PedCoT employs an educationally grounded prompt structure, a two-stage interactive mechanism, and embodied reasoning guidance—requiring neither fine-tuning nor exemplars—to elicit fine-grained identification of logical fallacies, computational deviations, and other errors within reasoning chains. Contribution/Results: Evaluated on multi-difficulty mathematical benchmarks—including MATH and GSM-Hard—PedCoT significantly outperforms strong baselines such as Chain-of-Thought and Self-Consistency, achieving absolute improvements of 12.6–23.4% in error identification accuracy. This establishes a robust foundation for automated mathematical assessment and self-correction.
📝 Abstract
Self-correction is emerging as a promising approach to mitigate the issue of hallucination in Large Language Models (LLMs). To facilitate effective self-correction, recent research has proposed mistake detection as its initial step. However, current literature suggests that LLMs often struggle with reliably identifying reasoning mistakes when using simplistic prompting strategies. To address this challenge, we introduce a unique prompting strategy, termed the Pedagogical Chain-of-Thought (PedCoT), which is specifically designed to guide the identification of reasoning mistakes, particularly mathematical reasoning mistakes. PedCoT consists of pedagogical principles for prompts (PPP) design, two-stage interaction process (TIP) and grounded PedCoT prompts, all inspired by the educational theory of the Bloom Cognitive Model (BCM). We evaluate our approach on two public datasets featuring math problems of varying difficulty levels. The experiments demonstrate that our zero-shot prompting strategy significantly outperforms strong baselines. The proposed method can achieve the goal of reliable mathematical mistake identification and provide a foundation for automatic math answer grading. The results underscore the significance of educational theory, serving as domain knowledge, in guiding prompting strategy design for addressing challenging tasks with LLMs effectively.