PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing physical reasoning benchmarks predominantly evaluate only final answers, neglecting the reasoning process; progressive evaluation methods rely on heuristic LLM scoring or linear assumptions, suffering from limited reliability and interpretability. Method: We propose a causal directed acyclic graph (DAG)-based framework for evaluating physical reasoning processes: it explicitly models structured dependencies among formulas via a DAG, proves its optimality in process representation, and integrates symbolic formula equivalence verification with theoretically grounded scoring to enable fine-grained, interpretable, and heuristic-free assessment. A rule-driven matching engine ensures cross-path consistency. Contribution/Results: Experiments show strong agreement between our framework and human experts (Cohen’s κ = 0.92), effectively exposing structural deficiencies in mainstream large language models during physical reasoning. The framework delivers precise, diagnostic signals for targeted model refinement and training.

Technology Category

Application Category

📝 Abstract

Benchmarks for competition-style reasoning have advanced evaluation in mathematics and programming, yet physics remains comparatively explored. Most existing physics benchmarks evaluate only final answers, which fail to capture reasoning processes, while recent stepwise methods rely on heuristic LLM-as-judge scoring or restrictive linear assumptions, limiting reliability and diagnostic validity. We introduce PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts' scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Evaluates causal reasoning processes in physics problems

Addresses limitations of heuristic-based scoring methods

Provides interpretable evaluation using DAG-based formula representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses DAGs to encode causal dependencies in physics reasoning

Implements rule-based symbolic formula equivalence matching

Provides fine-grained interpretable scoring without heuristic judgments

🔎 Similar Papers

No similar papers found.