🤖 AI Summary
This work addresses the challenge of directly training models for scientific hypothesis generation, formalized as $P(\text{hypothesis}|\text{background})$, which is hindered by combinatorial complexity scaling as $O(N^k)$. To overcome this, we propose the MOOSE-Star framework, which leverages probabilistic equation decomposition, motivation-guided hierarchical search, and bounded combinatorial mechanisms to reduce complexity from exponential to logarithmic. This enables, for the first time, end-to-end trainable modeling of the hypothesis generation process. Evaluated on TOMATO-Star—a large-scale dataset of 108,717 decomposed scientific papers—MOOSE-Star breaks through the “complexity wall” inherent in conventional brute-force sampling, facilitating efficient, scalable generation of scientific discoveries and supporting continual test-time expansion.
📝 Abstract
While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedback-driven training, leaving the direct modeling of the generative reasoning process, $P(\text{hypothesis}|\text{background})$ ($P(h|b)$), unexplored. We demonstrate that directly training $P(h|b)$ is mathematically intractable due to the combinatorial complexity ($O(N^k)$) inherent in retrieving and composing inspirations from a vast knowledge base. To break this barrier, we introduce MOOSE-Star, a unified framework enabling tractable training and scalable inference. In the best case, MOOSE-Star reduces complexity from exponential to logarithmic ($O(\log N)$) by (1) training on decomposed subtasks derived from the probabilistic equation of discovery, (2) employing motivation-guided hierarchical search to enable logarithmic retrieval and prune irrelevant subspaces, and (3) utilizing bounded composition for robustness against retrieval noise. To facilitate this, we release TOMATO-Star, a dataset of 108,717 decomposed papers (38,400 GPU hours) for training. Furthermore, we show that while brute-force sampling hits a ''complexity wall,'' MOOSE-Star exhibits continuous test-time scaling.