π€ AI Summary
Accurate temporal forecasting of viral variant distributions across subpopulations (e.g., countries) is critical for precision public health interventions and therapeutic design, yet existing machine learning approaches lack geographic specificity and neglect epidemiological transmission dynamics. We propose the first method modeling viral evolution as a system of linear ordinary differential equations (ODEs), explicitly incorporating geographically resolved transmission rates to jointly capture inter-subpopulation spread and mutation-dependent evolutionary dependencies. Our framework learns transmission rates in a data-driven manner and integrates joint probabilistic modeling across multiple subpopulations, yielding interpretable, location-specific temporal predictions. Evaluated on multi-year SARS-CoV-2 and influenza A/H3N2 genomic surveillance data, our approach significantly outperforms established baselines. Crucially, the learned transmission rates exhibit strong concordance with phylogenetic analyses, substantiating the modelβs biological plausibility and mechanistic validity.
π Abstract
Forecasting the change in the distribution of viral variants is crucial for therapeutic design and disease surveillance. This task poses significant modeling challenges due to the sharp differences in virus distributions across sub-populations (e.g., countries) and their dynamic interactions. Existing machine learning approaches that model the variant distribution as a whole are incapable of making location-specific predictions and ignore transmissions that shape the viral landscape. In this paper, we propose a sub-population specific protein evolution model, which predicts the time-resolved distributions of viral proteins in different locations. The algorithm explicitly models the transmission rates between sub-populations and learns their interdependence from data. The change in protein distributions across all sub-populations is defined through a linear ordinary differential equation (ODE) parametrized by transmission rates. Solving this ODE yields the likelihood of a given protein occurring in particular sub-populations. Multi-year evaluation on both SARS-CoV-2 and influenza A/H3N2 demonstrates that our model outperforms baselines in accurately predicting distributions of viral proteins across continents and countries. We also find that the transmission rates learned from data are consistent with the transmission pathways discovered by retrospective phylogenetic analysis.