HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the underdetermined problem of multiple valid solutions in scientific reasoning by introducing HypoSpace, the first systematic benchmark for evaluating large language models’ (LLMs) ability to generate diverse, valid, and comprehensive scientific hypotheses. Methodologically, it establishes three deterministic validation domains—causal graph inference, 3D voxel reconstruction, and gene interaction prediction—and proposes a tripartite evaluation framework measuring validity, uniqueness, and coverage, thereby explicitly diagnosing implicit mode collapse beyond conventional correctness-based metrics. Experiments reveal that as the space of acceptable hypotheses expands, LLMs exhibit significant degradation in uniqueness and coverage, exposing a fundamental diversity bottleneck; instruction tuning and reasoning-augmentation strategies yield only marginal improvements. This work provides the first quantifiable, diagnostic evaluation framework for assessing scientific creativity in LLMs.

Technology Category

Application Category

📝 Abstract

As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate multiple valid scientific hypotheses

Measuring hypothesis set quality through validity, uniqueness and recovery

Diagnosing mode collapse in hypothesis generation across structured domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs as finite hypothesis set samplers

Measures Validity, Uniqueness, and Recovery indicators

Tests in structured domains with deterministic validators

🔎 Similar Papers

Divergent Creativity in Humans and Large Language Models