🤖 AI Summary
This work investigates whether large language models (LLMs) can acquire causal reasoning capabilities solely through symbolic causal axiom demonstrations—without data fitting or hardcoded priors. Addressing the high cost of active interventions in real-world interaction, we propose *axiomatic demonstration training*: models abstract general rules from concise, formal axiom examples, enabling axiom-level generalization—from linear causal chains to complex graph structures, reverse-order chains, and branching topologies. We demonstrate, for the first time, that Transformers can spontaneously learn core causal logic, including d-separation and transitivity. We train a 67M-parameter model from scratch and extend to fine-tuning Llama-3.1-8B, using a structured, axiom-based dataset covering noise variants and topological diversity. On Corr2Cause and CLEAR benchmarks, our approach significantly outperforms baselines, achieves SOTA on several metrics—surpassing even GPT-4—and enables small models to attain high-accuracy causal judgment on long chains, reverse-order chains, and branching graphs.
📝 Abstract
For text-based AI systems to interact in the real world, causal reasoning is an essential skill. Since active interventions are costly, we study to what extent a system can learn causal reasoning from symbolic demonstrations of causal axioms. Specifically, we present an axiomatic training method where the system learns from multiple demonstrations of a causal axiom (or rule), rather than incorporating the axiom as an inductive bias or inferring it from data values. A key question is whether the system would learn to generalize from the axiom demonstrations to more complex scenarios. Our results, based on applying axiomatic training to learn the transitivity axiom and d-separation rule, indicate that such generalization is possible. To avoid data contamination issues, we start with a 67 million parameter transformer model and train it from scratch. On both tasks, we find that a model trained on linear causal chains (along with some noisy variations) can generalize well to complex graphs, including longer causal chains, causal chains with reversed order, and graphs with branching.To handle diverse text inputs, the same method is extended to finetune language models. Finetuning Llama-3.1 8B model on our axiomatic data leads to significant gains on causal benchmarks such as Corr2Cause and CLEAR, in some cases providing state-of-the-art performance surpassing GPT-4.