UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Traditional biomedical image analysis typically employs separate models for text generation and region segmentation, leading to fragmented information processing and inflexible deployment. To address this, we propose BioGrounded—the first general-purpose multimodal foundation model for biomedical imaging—unifying clinical report generation and anatomical/pathological region segmentation. Our method innovatively integrates a multimodal large language model (MLLM) with the Segment Anything Model (SAM) in a synergistic architecture, enabling end-to-end, prompt-free grounded interpretation. We curate a large-scale, million-sample biomedical image–text–mask triplet dataset covering ten imaging modalities to support multi-task joint reasoning. Evaluated across 84 internal and external datasets, BioGrounded achieves state-of-the-art performance on five core tasks: semantic segmentation, disease classification, region-level diagnosis, visual question answering, and radiology report generation—significantly enhancing both clinical analysis efficiency and result consistency.

Technology Category

Application Category

📝 Abstract

Multi-modal interpretation of biomedical images opens up novel opportunities in biomedical image analysis. Conventional AI approaches typically rely on disjointed training, i.e., Large Language Models (LLMs) for clinical text generation and segmentation models for target extraction, which results in inflexible real-world deployment and a failure to leverage holistic biomedical information. To this end, we introduce UniBiomed, the first universal foundation model for grounded biomedical image interpretation. UniBiomed is based on a novel integration of Multi-modal Large Language Model (MLLM) and Segment Anything Model (SAM), which effectively unifies the generation of clinical texts and the segmentation of corresponding biomedical objects for grounded interpretation. In this way, UniBiomed is capable of tackling a wide range of biomedical tasks across ten diverse biomedical imaging modalities. To develop UniBiomed, we curate a large-scale dataset comprising over 27 million triplets of images, annotations, and text descriptions across ten imaging modalities. Extensive validation on 84 internal and external datasets demonstrated that UniBiomed achieves state-of-the-art performance in segmentation, disease recognition, region-aware diagnosis, visual question answering, and report generation. Moreover, unlike previous models that rely on clinical experts to pre-diagnose images and manually craft precise textual or visual prompts, UniBiomed can provide automated and end-to-end grounded interpretation for biomedical image analysis. This represents a novel paradigm shift in clinical workflows, which will significantly improve diagnostic efficiency. In summary, UniBiomed represents a novel breakthrough in biomedical AI, unlocking powerful grounded interpretation capabilities for more accurate and efficient biomedical image analysis.

Problem

Research questions and friction points this paper is trying to address.

Unifies clinical text generation and biomedical object segmentation

Addresses inflexibility in real-world biomedical image analysis

Eliminates need for manual pre-diagnosis and prompt crafting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates MLLM and SAM for unified analysis

Uses 27M image-text-annotation triplets dataset

Automates end-to-end biomedical image interpretation

🔎 Similar Papers

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis