🤖 AI Summary
Constructing digital twins for biological time-series modeling remains highly manual and struggles with noise, high dimensionality, and latent variables.
Method: We propose a hybrid modular framework integrating chemical reaction network priors, Bayesian uncertainty quantification, and deep learning’s knowledge integration capabilities. Our approach systematically unifies sparse regression—particularly under the Bayesian paradigm—symbolic regression, deep learning, and large language models to synergistically fuse data-driven modeling with domain expertise.
Contributions/Results: (1) We empirically validate the superior interpretability and robustness of Bayesian sparse regression for biological system identification; (2) we characterize both the promise and reliability bottlenecks of deep learning in knowledge-guided twin construction; (3) we introduce the first unified benchmark suite specifically designed for evaluating biological digital twins, establishing a new paradigm for automated, trustworthy twin modeling. This work bridges mechanistic understanding and data-driven scalability, advancing reproducible, interpretable, and uncertainty-aware modeling of complex biological dynamics.
📝 Abstract
Recent technological advances have expanded the availability of high-throughput biological datasets, enabling the reliable design of digital twins of biomedical systems or patients. Such computational tools represent key reaction networks driving perturbation or drug response and can guide drug discovery and personalized therapeutics. Yet, their development still relies on laborious data integration by the human modeler, so that automated approaches are critically needed. The success of data-driven system discovery in Physics, rooted in clean datasets and well-defined governing laws, has fueled interest in applying similar techniques in Biology, which presents unique challenges. Here, we reviewed methodologies for automatically inferring digital twins from biological time series, which mostly involve symbolic or sparse regression. We evaluate algorithms according to eight biological and methodological challenges, associated to noisy/incomplete data, multiple conditions, prior knowledge integration, latent variables, high dimensionality, unobserved variable derivatives, candidate library design, and uncertainty quantification. Upon these criteria, sparse regression generally outperformed symbolic regression, particularly when using Bayesian frameworks. We further highlight the emerging role of deep learning and large language models, which enable innovative prior knowledge integration, though the reliability and consistency of such approaches must be improved. While no single method addresses all challenges, we argue that progress in learning digital twins will come from hybrid and modular frameworks combining chemical reaction network-based mechanistic grounding, Bayesian uncertainty quantification, and the generative and knowledge integration capacities of deep learning. To support their development, we further propose a benchmarking framework to evaluate methods across all challenges.