🤖 AI Summary
This work addresses the longstanding absence of a unified, secure, and collaborative data management platform in health and bioinformatics, which has hindered efficient sharing and governance of heterogeneous data. The authors propose and implement a health data management platform based on a lakehouse architecture integrated with data federation, systematically embedding FAIR principles into this framework for the first time. Built upon an open-source toolchain, the platform offers a scalable solution deployable either self-hosted or in the cloud. It supports multimodal interaction through a web interface, RESTful APIs, and Python/R clients. User studies demonstrate its usability across researchers with diverse technical backgrounds, while multi-organizational deployments confirm its flexibility and reproducibility.
📝 Abstract
Data management can be a complex challenge in fields such as bioinformatics and health sciences, which continuously generate extensive heterogeneous datasets. In the context of collaborative global health initiatives, secure storage and sharing of data are crucial to support impactful research. However, the absence of a unified data management platform complicates efficient data exchange and governance within these initiatives. In this paper, we introduce the design process of OpenHealth Lake, a data management prototype platform based on a data lakehouse architecture, data federation, and the FAIR principles. The platform is designed using open-source tools, guided by system requirements identified in previously published studies and complemented by insights from the existing literature. The current prototype platform comprises a user-friendly website, an open API, Python and R packages, allowing users to interact with the platform in multiple ways. Through a user study that included participants with varying technical backgrounds, we showed that our proposed data management prototype is both usable and useful. Our prototype design showcases the adaptability, scalability, and reproducibility of a lakehouse system that can be used by any organisation. It is designed as a flexible and complementary approach that allows organisations to customise data management systems to their specific requirements and resources, including cloud-based or self-hosted storage choices.