🤖 AI Summary
This study addresses the challenge of accurately predicting causal effects from natural language queries as a cost-effective alternative to expensive and time-consuming randomized controlled trials. To this end, the authors introduce Query2Effect, the first large-scale benchmark aligning natural language queries with experimental data, and propose a two-stage framework: first parsing queries into structured semantic representations, then predicting causal effects via a supervised encoder. By integrating fine-tuned large language models with structured semantic parsing, the method substantially outperforms off-the-shelf large language models prompted directly, reducing absolute prediction error by 27%–71% across multiple domains. This approach significantly enhances both prediction accuracy and cross-domain generalization capability.
📝 Abstract
Randomized controlled trials are a cornerstone of medicine and the social sciences as they enable reliable estimates of causal effects. However, they are costly and time-consuming to conduct, motivating interest in predicting causal effects from existing experimental evidence. Recent advances in large language models (LLMs) have demonstrated strong performance on knowledge-intensive tasks, raising the question of whether these models can be used for forecasting causal effect sizes. To investigate this, we introduce Query2Effect, a new large-scale benchmark consisting of more than 72,000 natural language questions aligned with experiment descriptions, created to simulate realistic information-seeking scenarios by varying query specificity along dimensions of implicitness, abstraction, and ambiguity. We then propose a two-step framework that first generates a synthetic structured representation of a query before predicting effect size using a supervised encoder model. Experiments show that finetuning plays a crucial role in improving prediction performance, with absolute error reducing by -27% up to -71% compared to prompted out-of-the-box LLMs, and that our two-step framework is beneficial for out-of-domain generalization, highlighting the benefits of separating semantic interpretation from numerical effect estimation.