BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery

📅 2024-11-15
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Training AI models for drug discovery—particularly protein language models (pLMs)—increasingly relies on large-scale GPU clusters, yet existing frameworks lack both efficiency and usability. Method: We introduce the first high-performance training framework tailored for biochemical AI, enabling scalable pLM development and deployment on hundred-GPU clusters. Built on PyTorch and Megatron-LM, it features a modular architecture supporting flexible integration of optimized data loading, distributed training, mixed-precision arithmetic, sequence parallelism, and high-throughput I/O. Contribution/Results: On 256 NVIDIA A100 GPUs, the framework completes pretraining of a 3-billion-parameter BERT-style pLM on over one trillion tokens in just 4.2 days—substantially lowering the barrier to large-scale pLM training. The framework is open-sourced, enhancing reproducibility and fostering collaborative innovation in computational biology.

Technology Category

Application Category

📝 Abstract
Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre-training and fine-tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open-source and free for everyone to use.
Problem

Research questions and friction points this paper is trying to address.

Facilitates large-scale AI model training for drug discovery
Integrates modular components into existing biology workflows
Enables efficient pLM training on massive datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular high-performance library for AI drug discovery
Facilitates large-scale GPU training for biology models
Open-source framework with community-contributable components
🔎 Similar Papers
No similar papers found.
P
Peter St. John
NVIDIA
D
Dejun Lin
NVIDIA
P
Polina Binder
NVIDIA
M
Malcolm Greaves
NVIDIA
V
Vega Shah
NVIDIA
J
John St. John
NVIDIA
A
Adrian Lange
A-Alpha Bio Inc.
Patrick D. Hsu
Patrick D. Hsu
Arc Institute | University of California, Berkeley
genome engineeringsynthetic biologymachine learning
R
Rajesh Illango
Arc Institute
Arvind Ramanathan
Arvind Ramanathan
Argonne National Laboratory
Machine LearningComputational BiologyMolecular biophysicsenzyme catalysishigher-order statistics
A
A. Anandkumar
Caltech
D
David H Brookes
Dyno Therapeutics
A
A. Busia
Dyno Therapeutics
A
Abhishaike Mahajan
Dyno Therapeutics
Stephen Malina
Stephen Malina
ML Scientist, Dyno Therapeutics
machine learningcomputational biology
N
Neha Prasad
Dyno Therapeutics
Sam Sinai
Sam Sinai
Dyno Therapeutics
Lindsay Edwards
Lindsay Edwards
Relation Therapeutics
Machine LearningDrug DiscoverySystems Biology
T
Thomas Gaudelet
Relation Therapeutics
C
Cristian Regep
Relation Therapeutics
Martin Steinegger
Martin Steinegger
Associate Professor, Seoul National University, Korea
BioinformaticsComputational Biology
Burkhard Rost
Burkhard Rost
Technical University of Munich
A
Alexander Brace
University of Chicago
K
Kyle Hippe
University of Chicago
L
Luca Naef
VantAI
K
Keisuke Kamata
Weights & Biases
G
George Armstrong
NVIDIA
K
Kevin Boyd
NVIDIA
Zhonglin Cao
Zhonglin Cao
Nvidia
Deep LearningMolecular DynamicsNanofluidicsComputational Materials
H
Han-Yi Chou
NVIDIA
S
Simon Chu
NVIDIA
Allan dos Santos Costa
Allan dos Santos Costa
Researcher, MIT Media Lab
deep learningstructural biologycomputational physics
S
Sajad Darabi
NVIDIA
E
Eric Dawson
NVIDIA
Kieran Didi
Kieran Didi
NVIDIA, Oxford University
Machine LearningProtein DesignArtificial IntelligenceDrug Design
Cong Fu
Cong Fu
Texas A&M University, Computer Science
Geometric Deep LearningAI for SciencePhysical SimulationsMoleculesQuantum Many-Body Physics
Mario Geiger
Mario Geiger
MIT
neural network
Michelle Gill
Michelle Gill
NVIDIA
D
Darren J. Hsu
NVIDIA
G
Gagan Kaushik
NVIDIA
Maria Korshunova
Maria Korshunova
NVIDIA
S
S. Kothen-Hill
NVIDIA
Y
Youhan Lee
NVIDIA
M
Meng Liu
NVIDIA
M
M. Livne
NVIDIA
Z
Zachary McClure
NVIDIA
Jonathan Mitchell
Jonathan Mitchell
NVIDIA
A
Alireza Moradzadeh
NVIDIA
O
Ohad Mosafi
NVIDIA
Y
Youssef L. Nashed
NVIDIA
S
Saee Paliwal
NVIDIA
Y
Yuxing Peng
NVIDIA
S
Sara Rabhi
NVIDIA
F
F. Ramezanghorbani
NVIDIA
Danny Reidenbach
Danny Reidenbach
Research Scientist, NVIDIA & UC Berkeley
AI for Drug Discovery
C
Camir Ricketts
NVIDIA
B
Brian Roland
NVIDIA
K
Kushal Shah
NVIDIA
T
Tyler Shimko
NVIDIA
H
Hassan Sirelkhatim
NVIDIA
S
Savitha Srinivasan
NVIDIA
A
Abraham C Stern
NVIDIA
D
Dorota Toczydlowska
NVIDIA
S
S. Veccham
NVIDIA
N
N. Venanzi
NVIDIA
A
Anton Vorontsov
NVIDIA
J
Jared Wilber
NVIDIA
I
Isabel Wilkinson
NVIDIA
W
Wei Jing Wong
NVIDIA
E
Eva Xue
NVIDIA
C
Cory Ye
NVIDIA
X
Xin Yu
NVIDIA
Y
Yang Zhang
NVIDIA
Guoqing Zhou
Guoqing Zhou
Guilin University of Technology
Remote sensing
B
Becca Zandstein
NVIDIA
Christian Dallago
Christian Dallago
NVIDIA & Duke
BioinformaticsBioCSBioML
B
Bruno Trentini
University of Oxford
E
E. Kucukbenli
NVIDIA
T
Timur Rvachov
NVIDIA
E
Eddie Calleja
NVIDIA
J
Johnny Israeli
NVIDIA
H
Harry Clifford
NVIDIA
R
Risto Haukioja
NVIDIA
N
Nicholas Haemel
Stanford University
K
Kyle Tretina
NVIDIA
N
Neha Tadimeti
NVIDIA
A
Anthony B Costa
NVIDIA