RxRx3-core: Benchmarking drug-target interactions in High-Content Microscopy

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-content screening (HCS) has long suffered from a scarcity of high-quality, publicly available datasets and the absence of standardized, representation-learning–oriented benchmarks—hindering progress in zero-shot drug–target interaction (DTI) inference. To address this, we introduce RxRx3-core, the first lightweight, high-fidelity, representation-learning–focused HCS benchmark dataset (18 GB, 222,601 images), capturing multi-condition cellular responses to 736 CRISPR knockouts and 1,674 compounds across eight concentrations. This work establishes the first unified integration of chemical perturbations, genetic editing, and high-content imaging. We provide pre-trained embeddings, a standardized evaluation protocol, and fully open-sourced code. RxRx3-core is publicly released on Hugging Face and Polaris, substantially lowering the barrier to HCS modeling and enabling reproducible, cross-institutional representation learning evaluation and biological discovery.

Technology Category

Application Category

📝 Abstract
High Content Screening (HCS) microscopy datasets have transformed the ability to profile cellular responses to genetic and chemical perturbations, enabling cell-based inference of drug-target interactions (DTI). However, the adoption of representation learning methods for HCS data has been hindered by the lack of accessible datasets and robust benchmarks. To address this gap, we present RxRx3-core, a curated and compressed subset of the RxRx3 dataset, and an associated DTI benchmarking task. At just 18GB, RxRx3-core significantly reduces the size barrier associated with large-scale HCS datasets while preserving critical data necessary for benchmarking representation learning models against a zero-shot DTI prediction task. RxRx3-core includes 222,601 microscopy images spanning 736 CRISPR knockouts and 1,674 compounds at 8 concentrations. RxRx3-core is available on HuggingFace and Polaris, along with pre-trained embeddings and benchmarking code, ensuring accessibility for the research community. By providing a compact dataset and robust benchmarks, we aim to accelerate innovation in representation learning methods for HCS data and support the discovery of novel biological insights.
Problem

Research questions and friction points this paper is trying to address.

Lack of accessible datasets for HCS representation learning
Need for robust benchmarks in drug-target interaction prediction
High storage demands hinder large-scale HCS data utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curated subset RxRx3-core for DTI benchmarking
Compressed 18GB dataset with 222,601 images
Includes pre-trained embeddings and benchmarking code
🔎 Similar Papers
No similar papers found.