Understanding LLM Scientific Reasoning through Promptings and Model's Explanation on the Answers

📅 2025-05-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically examines the capability boundaries and interpretability challenges of large language models (LLMs) in graduate-level scientific reasoning, using the GPQA benchmark. Methodologically, it evaluates seven prompting techniques—including zero-shot direct answering, chain-of-thought (CoT), and self-consistency—on GPT-4o. Results show that self-consistency achieves the highest accuracy (52.99%) but yields the poorest explanation quality, indicating LLMs’ reliance on pattern matching rather than genuine logical deduction. To address the inherent trade-off between accuracy and interpretability, the work introduces a novel “structured reasoning + hybrid AI/human-in-the-loop” paradigm. Qualitative attribution analysis further identifies direct answering, CoT, and zero-shot CoT as yielding the most coherent and faithful explanations. Collectively, this research provides both methodological foundations and empirical evidence for enhancing LLM reliability in high-stakes domains such as science, medicine, and law.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and problem-solving across various domains. However, their ability to perform complex, multi-step reasoning task-essential for applications in science, medicine, and law-remains an area of active investigation. This paper examines the reasoning capabilities of contemporary LLMs, analyzing their strengths, limitations, and potential for improvement. The study uses prompt engineering techniques on the Graduate-Level GoogleProof Q&A (GPQA) dataset to assess the scientific reasoning of GPT-4o. Five popular prompt engineering techniques and two tailored promptings were tested: baseline direct answer (zero-shot), chain-of-thought (CoT), zero-shot CoT, self-ask, self-consistency, decomposition, and multipath promptings. Our findings indicate that while LLMs exhibit emergent reasoning abilities, they often rely on pattern recognition rather than true logical inference, leading to inconsistencies in complex problem-solving. The results indicated that self-consistency outperformed the other prompt engineering technique with an accuracy of 52.99%, followed by direct answer (52.23%). Zero-shot CoT (50%) outperformed multipath (48.44%), decomposition (47.77%), self-ask (46.88%), and CoT (43.75%). Self-consistency performed the second worst in explaining the answers. Simple techniques such as direct answer, CoT, and zero-shot CoT have the best scientific reasoning. We propose a research agenda aimed at bridging these gaps by integrating structured reasoning frameworks, hybrid AI approaches, and human-in-the-loop methodologies. By critically evaluating the reasoning mechanisms of LLMs, this paper contributes to the ongoing discourse on the future of artificial general intelligence and the development of more robust, trustworthy AI systems.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' scientific reasoning using prompt engineering techniques
Evaluating strengths and limitations of LLMs in complex problem-solving
Proposing improvements for LLM reasoning via hybrid AI approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses prompt engineering techniques on GPQA dataset
Tests five popular and two tailored promptings
Proposes structured reasoning and hybrid AI approaches
🔎 Similar Papers
No similar papers found.
A
Alice Rueda
Interventional Psychiatry Program, St. Michael’s Hospital, Unity Health Toronto, Toronto, Ontario, Canada
M
Mohammed S. Hassan
Department of Electrical, Computer, and Biomedical Engineering, Toronto Metropolitan University, Toronto
A
Argyrios Perivolaris
Interventional Psychiatry Program, St. Michael’s Hospital, Unity Health Toronto, Toronto, Ontario, Canada
B
Bazen G. Teferra
Interventional Psychiatry Program, St. Michael’s Hospital, Unity Health Toronto, Toronto, Ontario, Canada
Reza Samavi
Reza Samavi
Associate Professor, Toronto Metropoiltan University
Security and PrivacyMachine Learning
Sirisha Rambhatla
Sirisha Rambhatla
Assistant Professor at the University of Waterloo
Machine LearningStatistical Signal ProcessingOptimizationAI for Healthcare
Yuqi Wu
Yuqi Wu
PhD , University of Alberta, Fudan University
Natural Language ProcessingComputational PsychiatryLarge Language Models
Y
Yanbo Zhang
Department of Psychiatry, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, Alberta, Canada
B
Bo Cao
Department of Psychiatry, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, Alberta, Canada
Divya Sharma
Divya Sharma
Department of Mathematics and Statistics, York University, Ontario, Canada
S
Sridhar Krishnan
Department of Electrical, Computer, and Biomed- ical Engineering, Toronto Metropolitan University, Toronto
V
Venkat Bhat
Interventional Psychiatry Program, St. Michael’s Hospital, Unity Health Toronto, Toronto, Ontario, Canada and the Department of Psychiatry, University of Toronto, Toronto, Ontario, Canada