🤖 AI Summary
To address the bottleneck in determining viral-like particle (VLP) protein subunit stoichiometry—namely, its reliance on high-purity samples and time-intensive experimental assays—this study introduces the first interpretable machine learning classification framework tailored for VLP assembly. We construct the first dedicated VLP stoichiometry dataset and integrate multi-strategy sequence encodings, including k-mer and position-specific scoring matrix (PSSM) features. A linear model ensures interpretability, augmented by LIME and SHAP for rigorous feature attribution. The method achieves state-of-the-art accuracy and systematically identifies conserved, assembly-related sequence motifs across diverse VLP families. All code and data are publicly released. Key contributions include: (i) the first curated, stoichiometry-labeled VLP dataset; (ii) an interpretable, lightweight ML pipeline explicitly designed for biological insight; and (iii) the unification of high-accuracy classification with mechanistic discovery of assembly-determining sequence features.
📝 Abstract
Virus-like particles (VLPs) are valuable for vaccine development due to their immune-triggering properties. Understanding their stoichiometry, the number of protein subunits to form a VLP, is critical for vaccine optimisation. However, current experimental methods to determine stoichiometry are time-consuming and require highly purified proteins. To efficiently classify stoichiometry classes in proteins, we curate a new dataset and propose an interpretable, data-driven pipeline leveraging linear machine learning models. We also explore the impact of feature encoding on model performance and interpretability, as well as methods to identify key protein sequence features influencing classification. The evaluation of our pipeline demonstrates that it can classify stoichiometry while revealing protein features that possibly influence VLP assembly. The data and code used in this work are publicly available at https://github.com/Shef-AIRE/StoicIML.