🤖 AI Summary
This work investigates whether large language models (LLMs) can serve as general-purpose scientific agents for feedback-driven Bayesian experimental design—e.g., genetic perturbation and molecular property discovery. Experiments reveal that current instruction-tuned LLMs are insensitive to experimental feedback, fail to dynamically adapt strategies, and underperform classical methods significantly. To address this, the authors propose LLMNN: a fine-tuning-free hybrid framework that integrates LLMs’ prior knowledge with feedback-aware nearest-neighbor sampling for efficient contextual experimental design—bypassing the need for strong in-context adaptation or parameter updates. Evaluated across multiple scientific domains, LLMNN matches or surpasses established baselines—including Gaussian processes and linear bandits—demonstrating, for the first time, the feasibility and competitiveness of coupling LLMs with lightweight feedback mechanisms in closed-loop scientific discovery.
📝 Abstract
Large language models (LLMs) have recently been proposed as general-purpose agents for experimental design, with claims that they can perform in-context experimental design. We evaluate this hypothesis using both open- and closed-source instruction-tuned LLMs applied to genetic perturbation and molecular property discovery tasks. We find that LLM-based agents show no sensitivity to experimental feedback: replacing true outcomes with randomly permuted labels has no impact on performance. Across benchmarks, classical methods such as linear bandits and Gaussian process optimization consistently outperform LLM agents. We further propose a simple hybrid method, LLM-guided Nearest Neighbour (LLMNN) sampling, that combines LLM prior knowledge with nearest-neighbor sampling to guide the design of experiments. LLMNN achieves competitive or superior performance across domains without requiring significant in-context adaptation. These results suggest that current open- and closed-source LLMs do not perform in-context experimental design in practice and highlight the need for hybrid frameworks that decouple prior-based reasoning from batch acquisition with updated posteriors.