Many Needles in a Haystack: Active Hit Discovery for Perturbation Experiments

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

234K/year
🤖 AI Summary
This study addresses the challenge of efficiently identifying multiple gene perturbations whose phenotypic effects exceed a predefined threshold—referred to as “hits”—under limited experimental budgets. Framing the problem as a sequential experimental design task, the work introduces Probability-of-Hit, the first acquisition function specifically tailored for multi-hit discovery. This function ranks candidate perturbations by their posterior probability of being hits, directly optimizing for threshold exceedance. Built within a Bayesian optimization framework, the approach combines Gaussian process modeling with the proposed acquisition function to actively explore high-value regions of the search space and is proven to be asymptotically optimal. Empirical evaluations on both synthetic benchmarks and real immunological datasets, including the Schmidt IL-2 dataset, demonstrate up to a 6.4% improvement in hit discovery efficiency over established baselines.
📝 Abstract
High-throughput gene perturbation experiments can test several genetic interventions in parallel, yet experimental budgets remain limited. A central goal is hit discovery: identifying as many perturbations as possible whose phenotypic effect exceeds a predefined threshold. Pure exploration strategies are statistically inefficient, wasting budget on low-value regions. Bayesian optimization methods offer a principled alternative but target a single global optimum, over-exploiting dominant modes while neglecting other high-value regions. We formalize hit discovery as a sequential experimental design problem and propose Probability-of-Hit, an acquisition function that directly targets threshold exceedance by ranking candidates according to their posterior probability of being a hit. We prove asymptotic optimality of this approach and demonstrate strong empirical performance on both synthetic benchmarks and real biological immunology datasets, including up to 6.4% improvement over baselines on the Schmidt IL-2 dataset.
Problem

Research questions and friction points this paper is trying to address.

hit discovery
gene perturbation
sequential experimental design
threshold exceedance
high-throughput screening
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probability-of-Hit
hit discovery
Bayesian optimization
sequential experimental design
gene perturbation
🔎 Similar Papers
No similar papers found.
A
Andrea Rubbi
Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK; Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
Arpit Merchant
Arpit Merchant
Wellcome Sanger Institute
Machine LearningGraph Neural NetworksAlgorithmic Ethics
S
Samuel Ogden
Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
A
Amir Akbarnejad
Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK; Cambridge Center for AI in Medicine, University of Cambridge, Cambridge, UK
Pietro Liò
Pietro Liò
Professor, University of Cambridge
AI & Comp Biology -> Medicine
Sattar Vakili
Sattar Vakili
MediaTek Research
Machine Learning
Mo Lotfollahi
Mo Lotfollahi
Wellcome Sanger institute, University of Cambridge
Computational biologyMachine learningDrug discovery