Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address the bottleneck in determining viral-like particle (VLP) protein subunit stoichiometry—namely, its reliance on high-purity samples and time-intensive experimental assays—this study introduces the first interpretable machine learning classification framework tailored for VLP assembly. We construct the first dedicated VLP stoichiometry dataset and integrate multi-strategy sequence encodings, including k-mer and position-specific scoring matrix (PSSM) features. A linear model ensures interpretability, augmented by LIME and SHAP for rigorous feature attribution. The method achieves state-of-the-art accuracy and systematically identifies conserved, assembly-related sequence motifs across diverse VLP families. All code and data are publicly released. Key contributions include: (i) the first curated, stoichiometry-labeled VLP dataset; (ii) an interpretable, lightweight ML pipeline explicitly designed for biological insight; and (iii) the unification of high-accuracy classification with mechanistic discovery of assembly-determining sequence features.

Technology Category

Application Category

📝 Abstract

Virus-like particles (VLPs) are valuable for vaccine development due to their immune-triggering properties. Understanding their stoichiometry, the number of protein subunits to form a VLP, is critical for vaccine optimisation. However, current experimental methods to determine stoichiometry are time-consuming and require highly purified proteins. To efficiently classify stoichiometry classes in proteins, we curate a new dataset and propose an interpretable, data-driven pipeline leveraging linear machine learning models. We also explore the impact of feature encoding on model performance and interpretability, as well as methods to identify key protein sequence features influencing classification. The evaluation of our pipeline demonstrates that it can classify stoichiometry while revealing protein features that possibly influence VLP assembly. The data and code used in this work are publicly available at https://github.com/Shef-AIRE/StoicIML.

Problem

Research questions and friction points this paper is trying to address.

Classify stoichiometry of Virus-like Particles

Optimize vaccine development with ML

Identify key protein sequence features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interpretable linear machine learning models

Feature encoding impact analysis

Public dataset and code availability

🔎 Similar Papers

Explainable Artificial Intelligence (XAI) for Malware Analysis: A Survey of Techniques, Applications, and Open Challenges