Prediction Bottlenecks Don't Discover Causal Structure (But Here's What They Actually Do)

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This study investigates whether models trained solely on predictive tasks can genuinely identify causal structures among variables. To this end, we introduce the first standardized and reproducible benchmark for falsifying causal discovery claims, encompassing both synthetic and real-world datasets, diverse intervention mechanisms (including do-interventions, soft noise, and random forcing), and appropriate control conditions. We systematically evaluate a range of methods—including Mamba state-space models, linear bottleneck architectures, Lasso, PCMCI, and Granger causality—and find that linear bottlenecks perform comparably to, or even better than, classical approaches in causal discovery. Notably, the purported “interventional advantage” of predictive bottlenecks is shown to stem from sample-size bias and vanishes under standard interventions, only reappearing in non-standard settings where it is replicated by traditional Granger methods. This work delineates the limits of prediction-driven approaches for causal inference and establishes a robust evaluation framework for future research.

📝 Abstract

A Mamba state-space model trained only for next-step prediction appears to recover Granger-causal structure through a simple readout $S = |W_{out} W_{in}|$, with early experiments suggesting the phenomenon generalized across architectures and benefited from interventional data at $p < 10^{-5}$. We package the protocol used to test that claim -- standardized synthetic generators (VAR/Lorenz/CauseMe-style), three intervention semantics ($do(X=c)$, soft-noise, random-forcing), edge-provenance cards on three real datasets, and size-matched control arms -- as a reusable falsification benchmark, and walk the claim through it in five stages. The method-level claim does not survive: (i) a plain linear bottleneck does as well or better; (ii) tuned Lasso beats the bottleneck on synthetic CauseMe-style benchmarks, and on Lorenz-96 (the only real benchmark with unambiguous ground truth) classical PCMCI and Granger lead a tight cluster in which the bottleneck trails; (iii) the headline intervention advantage is roughly 60% a sample-size confound, and the residual disappears under standard $do(X=c)$ interventions, surviving only under a non-standard random-forcing scheme; (iv) even that residual reproduces, with a larger effect, in classical bivariate Granger -- the effect is method-agnostic. What survives is a narrow characterization result; the benchmark is the lasting artifact, and each stage above is one of its control arms.

Problem

Research questions and friction points this paper is trying to address.

causal discovery

prediction bottleneck

Granger causality

interventional data

falsification benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal discovery

prediction bottleneck

falsification benchmark