🤖 AI Summary
Traditional causal discovery methods rely on structured data, suffering from high collection costs, strong assumptions, and limited capacity to uncover novel causal relationships; while existing LLM-based approaches can identify known causal relations, they lack genuine causal discovery capability. Method: We propose the first end-to-end, dataset-free causal discovery framework that iteratively retrieves domain documents and extracts variables, synergistically integrating statistical causal discovery algorithms with large language models. It jointly discovers both known and novel causal relations and introduces a novel missing-variable recommendation mechanism to dynamically expand the causal graph. Contribution/Results: The framework enables real-time, verifiable causal discovery and latent variable completion directly from unstructured text. Experiments demonstrate significant improvements in coverage, novelty, and interpretability of discovered causal relations compared to state-of-the-art baselines.
📝 Abstract
Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant computation for known relations, and unrealistic assumptions. While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover known and novel causal relations. In addition to causal discovery on initial variables, the missing variable proposal component of IRIS identifies and incorporates missing variables to expand the causal graphs. Our approach enables real-time causal discovery from only a set of initial variables without requiring pre-existing datasets.