🤖 AI Summary
It remains unclear whether protein Transformers exhibit biologically interpretable intelligence. Method: We introduce Protein-FN, the first benchmark dataset tailored for functional prediction; propose Sequence Protein Transformer (SPT), a lightweight architecture (e.g., SPT-Tiny with only 5.4M parameters); and design Sequence Score, a novel interpretability method that systematically decodes biologically relevant sequence patterns captured by the model. Contribution/Results: SPT achieves state-of-the-art accuracy—94.3% on AR and 99.6% on Protein-FN—outperforming comparable models. Sequence Score identifies critical residues strongly aligned with known functional sites and evolutionarily conserved motifs, empirically validating the biological plausibility of model decisions. This work establishes a new paradigm for interpretable modeling of protein language models and data-driven discovery of molecular mechanisms.
📝 Abstract
Deep neural networks, particularly Transformers, have been widely adopted for predicting the functional properties of proteins. In this work, we focus on exploring whether Protein Transformers can capture biological intelligence among protein sequences. To achieve our goal, we first introduce a protein function dataset, namely Protein-FN, providing over 9000 protein data with meaningful labels. Second, we devise a new Transformer architecture, namely Sequence Protein Transformers (SPT), for computationally efficient protein function predictions. Third, we develop a novel Explainable Artificial Intelligence (XAI) technique called Sequence Score, which can efficiently interpret the decision-making processes of protein models, thereby overcoming the difficulty of deciphering biological intelligence bided in Protein Transformers. Remarkably, even our smallest SPT-Tiny model, which contains only 5.4M parameters, demonstrates impressive predictive accuracy, achieving 94.3% on the Antibiotic Resistance (AR) dataset and 99.6% on the Protein-FN dataset, all accomplished by training from scratch. Besides, our Sequence Score technique helps reveal that our SPT models can discover several meaningful patterns underlying the sequence structures of protein data, with these patterns aligning closely with the domain knowledge in the biology community. We have officially released our Protein-FN dataset on Hugging Face Datasets https://huggingface.co/datasets/Protein-FN/Protein-FN. Our code is available at https://github.com/fudong03/BioIntelligence.