🤖 AI Summary
Social science research has long been constrained by an inefficient “observe–hypothesize–test” cycle and a lack of automated discovery mechanisms. This work proposes EXPERIGEN, a novel framework that establishes the first end-to-end closed loop for hypothesis generation and validation in the social sciences. By orchestrating a generator–experimenter dual-agent system, EXPERIGEN integrates large language models, Bayesian optimization principles, statistical testing, and real-world A/B experiments, supporting both multimodal and relational data. Empirical evaluations demonstrate that the framework generates 2–4 times more hypotheses than baseline methods, with predictive performance improvements of 7%–17%. Expert assessments reveal that 88% of generated hypotheses are novel and 70% possess substantive research value. A/B test results confirm high statistical significance (p < 1e-6) and a large effect size (344%).
📝 Abstract
Data-driven social science research is inherently slow, relying on iterative cycles of observation, hypothesis generation, and experimental validation. While recent data-driven methods promise to accelerate parts of this process, they largely fail to support end-to-end scientific discovery. To address this gap, we introduce EXPERIGEN, an agentic framework that operationalizes end-to-end discovery through a Bayesian optimization inspired two-phase search, in which a Generator proposes candidate hypotheses and an Experimenter evaluates them empirically. Across multiple domains, EXPERIGEN consistently discovers 2-4x more statistically significant hypotheses that are 7-17 percent more predictive than prior approaches, and naturally extends to complex data regimes including multimodal and relational datasets. Beyond statistical performance, hypotheses must be novel, empirically grounded, and actionable to drive real scientific progress. To evaluate these qualities, we conduct an expert review of machine-generated hypotheses, collecting feedback from senior faculty. Among 25 reviewed hypotheses, 88 percent were rated moderately or strongly novel, 70 percent were deemed impactful and worth pursuing, and most demonstrated rigor comparable to senior graduate-level research. Finally, recognizing that ultimate validation requires real-world evidence, we conduct the first A/B test of LLM-generated hypotheses, observing statistically significant results with p less than 1e-6 and a large effect size of 344 percent.