🤖 AI Summary
Existing logical reasoning evaluation benchmarks focus solely on final-answer accuracy, neglecting the quality of the reasoning process. This paper proposes FineLogic, a fine-grained evaluation framework that systematically assesses large language models’ reasoning capabilities along three dimensions: overall correctness, stepwise rationality, and representation alignment. We design multi-style supervised fine-tuning—using natural language, Logic, Chain-of-Symbol, and Tree-of-Symbol formats—and integrate human annotation, automated verification, and representation-layer probing analysis. Our study reveals, for the first time, that natural language supervision enhances out-of-distribution (OOD) generalization and long-context robustness, whereas symbolic supervision markedly improves step-level logical correctness and structural atomicity. Crucially, fine-tuning primarily refines the stepwise generation mechanism rather than inducing abrupt representational shifts. This work establishes a novel paradigm for disentangled evaluation and controllable training of reasoning capabilities.
📝 Abstract
Logical reasoning is a core capability for many applications of large language models (LLMs), yet existing benchmarks often rely solely on final-answer accuracy, failing to capture the quality and structure of the reasoning process. We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall benchmark accuracy, stepwise soundness, and representation-level alignment. In addition, to better understand how reasoning capabilities emerge, we conduct a comprehensive study on the effects of supervision format during fine-tuning. We construct four supervision styles (one natural language and three symbolic variants) and train LLMs under each. Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks, while symbolic reasoning styles promote more structurally sound and atomic inference chains. Further, our representation-level probing shows that fine-tuning primarily improves reasoning behaviors through step-by-step generation, rather than enhancing shortcut prediction or internalized correctness. Together, our framework and analysis provide a more rigorous and interpretable lens for evaluating and improving logical reasoning in LLMs.