Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This study investigates the separability of abstract modeling (i.e., relational representation) and arithmetic computation—two distinct sub-skills—in large language models’ (LLMs’) solving of mathematical word problems. Method: We construct a fine-grained evaluation framework based on GSM8K and SVAMP, integrating causal ablation, abstract representation extraction, and transferability analysis. Contribution/Results: We empirically demonstrate—for the first time—that LLMs (Llama-3/Qwen2.5, 1B–32B) achieve >90% accuracy in abstract modeling even without chain-of-thought (CoT), indicating robust, pre-existing, composable, and transferable abstract representations. In contrast, final answer errors stem predominantly from computational failures. We thus propose an “abstraction-first—then-computation” single-step forward mechanism, validated via causal ablation. Results show CoT primarily enhances computational robustness, with negligible improvement to abstract modeling—revealing that the reasoning bottleneck lies not in modeling, but in execution.

Technology Category

Application Category

📝 Abstract

Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for reasoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' abstract reasoning vs arithmetic computation in math problems

Disentangling CoT's impact on computation versus abstract formulation

Mechanistic analysis of abstract-then-compute process in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangles abstract formulation from arithmetic computation

Uses causal patching to confirm transferable abstractions

Demonstrates abstract-then-compute mechanism in single pass

🔎 Similar Papers

Achieving>97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems