Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?

πŸ“… 2025-05-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study systematically evaluates the complex logical reasoning capabilities of large language models (LLMs) under formal language guidance, analyzing three dimensions: model architecture, task type, and reasoning trajectory format. Method: We propose a unified, controllable evaluation framework integrating formal language modeling, trajectory-format analysis (e.g., Program-of-Thoughts [PoT], Chain-of-Thought [CoT]), and lightweight fine-tuning via rejection sampling. Contribution/Results: We find that reasoning-augmented β€œthinking” models significantly outperform instruction-tuned models; all evaluated models exhibit a shared bottleneck in inductive reasoning; PoT demonstrates superior generalization across formal languages; and lightweight fine-tuning enables small-scale models to achieve state-of-the-art cross-formal-language reasoning performance. Experiments across multiple logic reasoning benchmarks confirm that formal language grounding enhances both reasoning reliability and interpretability. The codebase and comprehensive evaluation reports are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at https://github.com/jiangjin1999/FormalEval.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs' performance in complex logical reasoning using formal languages
Assess limitations of LLMs in inductive reasoning with formal language
Enhance small language models with formal-relative training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive evaluation of LLMs using formal languages
Thinking models outperform Instruct models with formal language
Rejected fine-tuning enhances generalization across formal languages
πŸ”Ž Similar Papers
No similar papers found.
J
Jin Jiang
Peking University
J
Jianing Wang
Meituan Group
Y
Yuchen Yan
Meituan Group, Zhejiang University
Y
Yang Liu
Meituan Group
J
Jianhua Zhu
Peking University
M
Mengdi Zhang
Meituan Group
X
Xunliang Cai
Meituan Group
Liangcai Gao
Liangcai Gao
Peking University
artificial intelligence