DAG-Math: Graph-Guided Mathematical Reasoning in LLMs

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the fundamental nature of mathematical reasoning in large language models (LLMs)—specifically, whether it stems from search, memorization, or rule-consistent logical deduction—moving beyond mere answer correctness. Method: We propose DAG-MATH, a framework that formalizes chain-of-thought (CoT) reasoning as a regularized stochastic process over a directed acyclic graph (DAG), where nodes represent intermediate states and edges denote rule applications; we further introduce a logic closeness metric to quantify the internal consistency of reasoning paths. Using this framework, we construct a novel benchmark enabling fine-grained evaluation of reasoning trajectories across multiple mathematical reasoning datasets. Contribution/Results: Experiments reveal substantial variation in logical consistency among models achieving identical PASS@k accuracy, demonstrating that our framework effectively uncovers disparities in reasoning quality—thereby offering a more nuanced, structure-aware assessment of LLM mathematical reasoning beyond surface-level correctness.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT), yet it remains unclear whether this success stems from search, rote procedures, or rule-consistent reasoning. To address this, we propose modeling CoT as a certain rule-based stochastic process over directed acyclic graphs (DAGs), where nodes represent intermediate derivation states and edges encode rule applications. Within this framework, we introduce logical closeness, a metric that quantifies how well a model's CoT trajectory (i.e., the LLM's final output) adheres to the DAG structure, providing evaluation beyond classical PASS@k metrics. Building on this, we introduce the DAG-MATH CoT format and construct a benchmark that guides LLMs to generate CoT trajectories in this format, thereby enabling the evaluation of their reasoning ability under our framework. Across standard mathematical reasoning datasets, our analysis uncovers statistically significant differences in reasoning fidelity among representative LLM families-even when PASS@k is comparable-highlighting gaps between final-answer accuracy and rule-consistent derivation. Our framework provides a balance between free-form CoT and formal proofs systems, offering actionable diagnostics for LLMs reasoning evaluation. Our benchmark and code are available at: https://github.com/YuanheZ/DAG-MATH-Formatted-CoT.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning fidelity in LLMs beyond answer accuracy
Modeling Chain-of-Thought as rule-based DAG stochastic processes
Quantifying logical adherence between CoT trajectories and DAG structures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Models CoT as rule-based stochastic DAG process
Introduces logical closeness metric for reasoning evaluation
Proposes DAG-MATH benchmark for structured reasoning assessment
🔎 Similar Papers
No similar papers found.