FOL-Pretrain: A complexity annotated corpus of first-order logic

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing studies on large language models (LLMs) lack deep mechanistic understanding of complex algorithmic reasoning—particularly first-order logic (FOL) inference—due to reliance on simplistic tasks, uncontrolled pretraining data, and insufficient interpretability. Method: The authors introduce the first large-scale, fully open-source FOL reasoning trace corpus (3.5B tokens), integrating 8.8M LLM-augmented human-annotated traces and 7.5M synthetically generated traces rigorously verified by automated theorem provers. They propose an algorithmic provenance metadata framework for fine-grained complexity quantification and step-level verifiability; formal trajectory serialization; complexity-aware stratified sampling; and an LLM–theorem prover collaborative annotation pipeline. Contribution/Results: This dataset significantly enhances LLMs’ generalization, stepwise consistency, and analyzability/control over complexity-aware symbolic reasoning, enabling rigorous empirical investigation into algorithmic reasoning mechanisms.

Technology Category

Application Category

📝 Abstract
Transformer-based large language models (LLMs) have demonstrated remarkable reasoning capabilities such as coding and solving mathematical problems to commonsense inference. While these tasks vary in complexity, they all require models to integrate and compute over structured information. Despite recent efforts to reverse-engineer LLM behavior through controlled experiments, our understanding of how these models internalize and execute complex algorithms remains limited. Progress has largely been confined to small-scale studies or shallow tasks such as basic arithmetic and grammatical pattern matching. One barrier to deeper understanding is the nature of pretraining data -- vast, heterogeneous, and often poorly annotated, making it difficult to isolate mechanisms of reasoning. To bridge this gap, we introduce a large-scale, fully open, complexity-annotated dataset of first-order logic reasoning traces, designed to probe and analyze algorithmic reasoning in LLMs. The dataset consists of 3.5 billion tokens, including 8.8 million LLM-augmented, human-annotated examples and 7.5 million synthetically generated examples. Each synthetic example is verifiably correct, produced by a custom automated theorem solver, and accompanied by metadata tracing its algorithmic provenance. We aim to provide a scalable, interpretable artifact for studying how LLMs learn and generalize symbolic reasoning processes, paving the way for more transparent and targeted investigations into the algorithmic capabilities of modern models.
Problem

Research questions and friction points this paper is trying to address.

Understanding how LLMs internalize complex algorithms
Lack of annotated data for studying reasoning mechanisms
Need scalable tools to analyze symbolic reasoning in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale complexity-annotated first-order logic dataset
Custom automated theorem solver for synthetic examples
LLM-augmented human-annotated reasoning traces