Why Pool When You Can Flow? Active Learning with GFlowNets

๐Ÿ“… 2025-08-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Traditional pool-based active learning methods (e.g., BALD) face severe computational bottlenecks in ultra-large-scale virtual screening (billions of molecules), as they require exhaustive evaluation of the entire unlabeled poolโ€”rendering them inefficient and poorly scalable. To address this, we propose BALD-GFlowNet: the first active learning framework that integrates generative flow networks (GFlowNets) to directly sample high-value molecules proportional to their BALD information gain, thereby eliminating pool-wide scoring entirely. Its computational complexity is independent of pool size, enabling high-throughput molecular generation. Empirically, BALD-GFlowNet achieves screening performance on par with standard BALD while markedly improving structural diversity among selected compounds. Experiments on drug discovery virtual screening tasks demonstrate its superior efficiency, scalability, and chemical validity, establishing a novel paradigm for active learning at billion-scale.

Technology Category

Application Category

๐Ÿ“ Abstract
The scalability of pool-based active learning is limited by the computational cost of evaluating large unlabeled datasets, a challenge that is particularly acute in virtual screening for drug discovery. While active learning strategies such as Bayesian Active Learning by Disagreement (BALD) prioritize informative samples, it remains computationally intensive when scaled to libraries containing billions samples. In this work, we introduce BALD-GFlowNet, a generative active learning framework that circumvents this issue. Our method leverages Generative Flow Networks (GFlowNets) to directly sample objects in proportion to the BALD reward. By replacing traditional pool-based acquisition with generative sampling, BALD-GFlowNet achieves scalability that is independent of the size of the unlabeled pool. In our virtual screening experiment, we show that BALD-GFlowNet achieves a performance comparable to that of standard BALD baseline while generating more structurally diverse molecules, offering a promising direction for efficient and scalable molecular discovery.
Problem

Research questions and friction points this paper is trying to address.

Scalable active learning for large unlabeled datasets
Computational efficiency in virtual drug screening
Generative sampling to replace pool-based acquisition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Flow Networks for active learning
Direct sampling proportional to BALD reward
Scalability independent of unlabeled pool size
๐Ÿ”Ž Similar Papers
No similar papers found.
R
Renfei Zhang
School of Computer Science, Simon Fraser University, Burnaby, BC, Canada
M
Mohit Pandey
Vancouver Prostate Centre, University of British Columbia, Vancouver, BC, Canada
Artem Cherkasov
Artem Cherkasov
University of British Columbia
Androgen ReceptorProstate CancerCheminformaticsComputer-Aided Drug DesignQSAR
M
Martin Ester
School of Computer Science, Simon Fraser University, Burnaby, BC, Canada