Causal Reasoning Favors Encoders: On The Limits of Decoder-Only Models

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work investigates how architectural choices in large language models (LLMs) affect causal reasoning, specifically on multi-hop, conjunctive causal tasks under in-context learning (ICL). It identifies a critical limitation: decoder-only models are highly susceptible to spurious lexical correlations and exhibit poor robustness to distributional shifts. In contrast, encoder-only (e.g., BERT) and encoder-decoder (e.g., T5) architectures demonstrate markedly stronger cross-distribution robustness and short-range causal reasoning—both in zero-/few-shot ICL and supervised fine-tuning. The study provides the first systematic evidence that superior “latent-space projection capability” underpins this advantage, challenging the assumption that scaling decoder-only models is optimal for causal reasoning. Experiments span both natural-language and symbolic causal tasks, showing that lightweight encoder-based models can efficiently match or surpass ultra-large decoder-only models. These findings establish a new architectural paradigm for causal reasoning in LLMs.

Technology Category

Application Category

📝 Abstract

In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.

Problem

Research questions and friction points this paper is trying to address.

Evaluating causal reasoning in decoder-only vs. encoder models

Assessing ICL's limitations for reliable multihop conjunctive reasoning

Testing model robustness to distribution shifts in causal tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder architectures outperform decoder-only models in causal reasoning

Fine-tuned encoders generalize robustly across distributional shifts

Targeted fine-tuning enables cost-effective short-horizon causal reasoning

🔎 Similar Papers

Causal Inference with Large Language Model: A Survey