🤖 AI Summary
This work addresses the failure of large language models (LLMs) in reasoning about novel high-school mathematics problems—particularly out-of-distribution (OOD) questions. We propose DeduCE, the first framework to formalize deductive consistency as two decoupled subtasks: *premise understanding* and *multi-hop derivation*, and introduce a memory-robust synthetic perturbation evaluation pipeline. Through fine-grained chain-of-thought analysis and controlled experiments, we find that premise understanding remains robust, whereas accuracy drops by over 40% beyond three reasoning steps—a degradation masked by conventional final-answer accuracy metrics. Our study reveals that multi-hop derivation capacity constitutes a universal bottleneck, transcending task-specific or dataset-specific limitations. By shifting evaluation from end-state correctness to intermediate reasoning fidelity, DeduCE establishes a new diagnostic paradigm for identifying and enhancing deep deductive reasoning in LLMs.
📝 Abstract
Despite great performance on Olympiad-level reasoning problems, frontier large language models can still struggle on high school math when presented with novel problems outside standard benchmarks. Going beyond final accuracy, we propose a deductive consistency metric to analyze chain-of-thought output from language models (LMs).Formally, deductive reasoning involves two subtasks: understanding a set of input premises and inferring the conclusions that follow from them. The proposed metric studies LMs' performance on these subtasks, with the goal of explaining LMs' reasoning errors on novel problems: how well do LMs understand input premises with increasing context lengths, and how well can they infer conclusions over multiple reasoning hops? Since existing benchmarks may be memorized, we develop a pipeline to evaluate LMs' deductive consistency on novel, perturbed versions of benchmark problems. On novel grade school math problems (GSM-8k), we find that LMs are fairly robust to increasing number of input premises, but suffer significant accuracy decay as the number of reasoning hops is increased. Interestingly, these errors are masked in the original benchmark as all models achieve near 100% accuracy. As we increase the number of solution steps using a synthetic dataset, prediction over multiple hops still remains the major source of error compared to understanding input premises. Other factors, such as shifts in language style or natural propagation of early errors do not explain the trends. Our analysis provides a new view to characterize LM reasoning -- as computations over a window of input premises and reasoning hops -- that can provide unified evaluation across problem domains.