Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the bottleneck in determining viral-like particle (VLP) protein subunit stoichiometry—namely, its reliance on high-purity samples and time-intensive experimental assays—this study introduces the first interpretable machine learning classification framework tailored for VLP assembly. We construct the first dedicated VLP stoichiometry dataset and integrate multi-strategy sequence encodings, including k-mer and position-specific scoring matrix (PSSM) features. A linear model ensures interpretability, augmented by LIME and SHAP for rigorous feature attribution. The method achieves state-of-the-art accuracy and systematically identifies conserved, assembly-related sequence motifs across diverse VLP families. All code and data are publicly released. Key contributions include: (i) the first curated, stoichiometry-labeled VLP dataset; (ii) an interpretable, lightweight ML pipeline explicitly designed for biological insight; and (iii) the unification of high-accuracy classification with mechanistic discovery of assembly-determining sequence features.

Technology Category

Application Category

📝 Abstract
Virus-like particles (VLPs) are valuable for vaccine development due to their immune-triggering properties. Understanding their stoichiometry, the number of protein subunits to form a VLP, is critical for vaccine optimisation. However, current experimental methods to determine stoichiometry are time-consuming and require highly purified proteins. To efficiently classify stoichiometry classes in proteins, we curate a new dataset and propose an interpretable, data-driven pipeline leveraging linear machine learning models. We also explore the impact of feature encoding on model performance and interpretability, as well as methods to identify key protein sequence features influencing classification. The evaluation of our pipeline demonstrates that it can classify stoichiometry while revealing protein features that possibly influence VLP assembly. The data and code used in this work are publicly available at https://github.com/Shef-AIRE/StoicIML.
Problem

Research questions and friction points this paper is trying to address.

Classify stoichiometry of Virus-like Particles
Optimize vaccine development with ML
Identify key protein sequence features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interpretable linear machine learning models
Feature encoding impact analysis
Public dataset and code availability
🔎 Similar Papers
No similar papers found.
Jiayang Zhang
Jiayang Zhang
AI Research Engineer, The University of Sheffield
Healthcare AIAI for biomedicineMultimodal AI
Xianyuan Liu
Xianyuan Liu
University of Sheffield
Deep LearningMaterials DesignMachine Learning
W
Wei Wu
School of Chemical, Materials and Biological Engineering, University of Sheffield, United Kingdom
Sina Tabakhi
Sina Tabakhi
Doctoral Researcher, School of Computer Science, University of Sheffield
Machine LearningGraph Neural NetworksFeature SelectionMultimodal LearningMultiomics
Wenrui Fan
Wenrui Fan
AI Research Engineer, The University of Sheffield
Multi-modal AISelf-supervised learningComputer Vision
S
Shuo Zhou
Centre for Machine Intelligence and School of Computer Science, University of Sheffield, United Kingdom
K
K. L. Tee
School of Chemical, Materials and Biological Engineering, University of Sheffield, United Kingdom
T
T. S. Wong
School of Chemical, Materials and Biological Engineering, University of Sheffield, United Kingdom
Haiping Lu
Haiping Lu
Professor of Machine Learning, University of Sheffield
Machine learningMultimodal AIAI4HealthAI4ScienceOpen-source software