Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing logical reasoning evaluation benchmarks focus solely on final-answer accuracy, neglecting the quality of the reasoning process. This paper proposes FineLogic, a fine-grained evaluation framework that systematically assesses large language models’ reasoning capabilities along three dimensions: overall correctness, stepwise rationality, and representation alignment. We design multi-style supervised fine-tuning—using natural language, Logic, Chain-of-Symbol, and Tree-of-Symbol formats—and integrate human annotation, automated verification, and representation-layer probing analysis. Our study reveals, for the first time, that natural language supervision enhances out-of-distribution (OOD) generalization and long-context robustness, whereas symbolic supervision markedly improves step-level logical correctness and structural atomicity. Crucially, fine-tuning primarily refines the stepwise generation mechanism rather than inducing abrupt representational shifts. This work establishes a novel paradigm for disentangled evaluation and controllable training of reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Logical reasoning is a core capability for many applications of large language models (LLMs), yet existing benchmarks often rely solely on final-answer accuracy, failing to capture the quality and structure of the reasoning process. We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall benchmark accuracy, stepwise soundness, and representation-level alignment. In addition, to better understand how reasoning capabilities emerge, we conduct a comprehensive study on the effects of supervision format during fine-tuning. We construct four supervision styles (one natural language and three symbolic variants) and train LLMs under each. Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks, while symbolic reasoning styles promote more structurally sound and atomic inference chains. Further, our representation-level probing shows that fine-tuning primarily improves reasoning behaviors through step-by-step generation, rather than enhancing shortcut prediction or internalized correctness. Together, our framework and analysis provide a more rigorous and interpretable lens for evaluating and improving logical reasoning in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluates logical reasoning quality beyond final-answer accuracy

Studies supervision formats' impact on LLM reasoning emergence

Probes stepwise generation vs shortcut prediction in reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

FineLogic framework evaluates reasoning in three dimensions

Natural language supervision enhances generalization in LLMs

Symbolic reasoning promotes structurally sound inference chains

🔎 Similar Papers

No similar papers found.