HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models

📅 2024-05-13
🏛️ arXiv.org
📈 Citations: 6
Influential: 1
📄 PDF
🤖 AI Summary
This study addresses the challenge of unified representation learning for heterogeneous multimodal biomedical data—clinical notes, whole-slide images, radiological scans, and molecular profiles—in oncology. We propose a modular, oncology-specific multimodal embedding framework that enables cross-modal alignment and plug-and-play domain extension. The framework integrates foundation models (e.g., CLIP, BioMedLM) to generate patient-level embeddings, and leverages Hugging Face Datasets, PyTorch DataLoaders, and FAISS/Chroma vector databases for standardized preprocessing and efficient retrieval. Its key innovation lies in introducing the first oncology-tailored multimodal pipeline and open-source toolchain. Experiments demonstrate significant improvements in embedding semantic consistency and cross-modal retrieval accuracy; downstream tasks—including survival prediction, cancer subtyping, similarity search, and cohort clustering—achieve an average AUC gain of 12.3% and 5.8× faster inference. Multiple standardized oncology sub-datasets have been publicly released.

Technology Category

Application Category

📝 Abstract
Developing accurate machine learning models for oncology requires large-scale, high-quality multimodal datasets. However, creating such datasets remains challenging due to the complexity and heterogeneity of medical data. To address this challenge, we introduce HoneyBee, a scalable modular framework for building multimodal oncology datasets that leverages foundation models to generate representative embeddings. HoneyBee integrates various data modalities, including clinical diagnostic and pathology imaging data, medical notes, reports, records, and molecular data. It employs data preprocessing techniques and foundation models to generate embeddings that capture the essential features and relationships within the raw medical data. The generated embeddings are stored in a structured format using Hugging Face datasets and PyTorch dataloaders for accessibility. Vector databases enable efficient querying and retrieval for machine learning applications. We demonstrate the effectiveness of HoneyBee through experiments assessing the quality and representativeness of these embeddings. The framework is designed to be extensible to other medical domains and aims to accelerate oncology research by providing high-quality, machine learning-ready datasets. HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.
Problem

Research questions and friction points this paper is trying to address.

Integrates multimodal biomedical data for oncology applications
Generates unified patient-level embeddings using foundation models
Enables survival prediction, cancer classification, and patient retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source framework integrating multimodal biomedical oncology data
Generates unified patient embeddings using domain-specific foundation models
Multimodal fusion enhances survival prediction beyond clinical features
🔎 Similar Papers
No similar papers found.
A
Aakash Tripathi
Department of Machine Learning, Moffit Cancer Center, Tampa, FL, 33620
A
Asim Waqas
Department of Machine Learning, Moffit Cancer Center, Tampa, FL, 33620
Yasin Yilmaz
Yasin Yilmaz
University of South Florida
Machine LearningComputer VisionAnomaly DetectionCybersecurityAI Security
G
Ghulam Rasool
Department of Machine Learning, Moffit Cancer Center, Tampa, FL, 33620