A Privacy-Preserving Ecosystem for Developing Machine Learning Algorithms Using Patient Data: Insights from the TUM.ai Makeathon

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Addressing the dual challenges of GDPR compliance and patient privacy preservation in rare-disease settings with limited clinical data. Method: We propose a privacy-by-design clinical AI modeling framework that integrates a synthetic clinical knowledge graph (cKG) for structure-preserving initial modeling and leverages the FeatureCloud federated learning platform to enable secure, on-site model training and evaluation within hospital-controlled environments. A multi-stage security protocol ensures raw data never leaves the premises—comprising cKG-based pre-modeling with content-level de-identification, end-to-end data isolation within sandboxed execution environments, automated security pipelines, and aggregation-only metric evaluation. The framework supports secure, collaborative analysis of multi-omics and heterogeneous clinical data. Contribution/Results: At TUM.ai Makeathon 2024, 50 participants successfully developed patient classification and diagnostic models without accessing any real patient data, demonstrating the framework’s feasibility, regulatory compliance, and efficiency in privacy-constrained collaborative AI development.

Technology Category

Application Category

📝 Abstract

The integration of clinical data offers significant potential for the development of personalized medicine. However, its use is severely restricted by the General Data Protection Regulation (GDPR), especially for small cohorts with rare diseases. High-quality, structured data is essential for the development of predictive medical AI. In this case study, we propose a novel, multi-stage approach to secure AI training: (1) The model is designed on a simulated clinical knowledge graph (cKG). This graph is used exclusively to represent the structural characteristics of the real cKG without revealing any sensitive content. (2) The model is then integrated into the FeatureCloud (FC) federated learning framework, where it is prepared in a single-client configuration within a protected execution environment. (3) Training then takes place within the hospital environment on the real cKG, either under the direct supervision of hospital staff or via a fully automated pipeline controlled by the hospital. (4) Finally, verified evaluation scripts are executed, which only return aggregated performance metrics. This enables immediate performance feedback without sensitive patient data or individual predictions, leaving the clinic. A fundamental element of this approach involves the incorporation of a cKG, which serves to organize multi-omics and patient data within the context of real-world hospital environments. This approach was successfully validated during the TUM.ai Makeathon 2024 (TUMaiM24) challenge set by the Dr. von Hauner Children's Hospital (HCH-LMU): 50 students developed models for patient classification and diagnosis without access to real data. Deploying secure algorithms via federated frameworks, such as the FC framework, could be a practical way of achieving privacy-preserving AI in healthcare.

Problem

Research questions and friction points this paper is trying to address.

Developing privacy-preserving machine learning algorithms using patient data

Overcoming GDPR restrictions for small cohorts with rare diseases

Enabling secure AI training without exposing sensitive clinical information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulated knowledge graph for model design

Federated learning in protected execution environment

Hospital-based training with automated evaluation scripts

🔎 Similar Papers

No similar papers found.