🤖 AI Summary
This study addresses key challenges hindering the clinical deployment of generative AI in healthcare—namely, data fragmentation across institutions (data silos), modality incompatibility, and poor integration into clinical workflows. We propose a data-centric paradigm for healthcare generative AI, built upon a sustainable, multimodal medical data ecosystem. Our framework integrates semantic vector search, context-aware querying, large-scale pretraining, and domain-specific fine-tuning to jointly support pretraining, instruction tuning, and agent-level reasoning. Crucially, we treat the data ecosystem itself as the foundational infrastructure for generative AI—enabling end-to-end pipelines for clinical note generation, diagnostic assistance, personalized treatment recommendation, and clinical decision support. Experimental results demonstrate significant reductions in clinicians’ cognitive load, improved diagnostic accuracy, and enhanced treatment plan appropriateness. The approach provides both a methodological framework and scalable infrastructure for deploying trustworthy, production-ready healthcare foundation models.
📝 Abstract
Generative Artificial Intelligence (GenAI) is taking the world by storm. It promises transformative opportunities for advancing and disrupting existing practices, including healthcare. From large language models (LLMs) for clinical note synthesis and conversational assistance to multimodal systems that integrate medical imaging, electronic health records, and genomic data for decision support, GenAI is transforming the practice of medicine and the delivery of healthcare, such as diagnosis and personalized treatments, with great potential in reducing the cognitive burden on clinicians, thereby improving overall healthcare delivery. However, GenAI deployment in healthcare requires an in-depth understanding of healthcare tasks and what can and cannot be achieved. In this paper, we propose a data-centric paradigm in the design and deployment of GenAI systems for healthcare. Specifically, we reposition the data life cycle by making the medical data ecosystem as the foundational substrate for generative healthcare systems. This ecosystem is designed to sustainably support the integration, representation, and retrieval of diverse medical data and knowledge. With effective and efficient data processing pipelines, such as semantic vector search and contextual querying, it enables GenAI-powered operations for upstream model components and downstream clinical applications. Ultimately, it not only supplies foundation models with high-quality, multimodal data for large-scale pretraining and domain-specific fine-tuning, but also serves as a knowledge retrieval backend to support task-specific inference via the agentic layer. The ecosystem enables the deployment of GenAI for high-quality and effective healthcare delivery.