HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction

📅 2025-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing molecular language models struggle to jointly represent the chemical diversity of therapeutic peptides—including non-canonical modifications—and their cyclic topologies; SMILES obscures ring features and yields excessively long sequences, while amino acid sequences cannot encode unnatural modifications. Method: We introduce the first peptide language model based on the Hierarchical Editing Language for Macromolecules (HELM), the first application of HELM syntax in pre-trained language modeling. We design a hierarchical editing-aware DeBERTa architecture that jointly encodes monomer chemical composition and connectivity topology. Contribution/Results: The model is self-supervised pre-trained on 39,079 linear and cyclic peptides and, after fine-tuning, significantly outperforms SMILES-based state-of-the-art methods on cyclic peptide membrane permeability and peptide–protein interaction prediction—achieving superior data efficiency and generalization. This work bridges the representational gap between small-molecule and protein language models.

Technology Category

Application Category

📝 Abstract
Therapeutic peptides have emerged as a pivotal modality in modern drug discovery, occupying a chemically and topologically rich space. While accurate prediction of their physicochemical properties is essential for accelerating peptide development, existing molecular language models rely on representations that fail to capture this complexity. Atom-level SMILES notation generates long token sequences and obscures cyclic topology, whereas amino-acid-level representations cannot encode the diverse chemical modifications central to modern peptide design. To bridge this representational gap, the Hierarchical Editing Language for Macromolecules (HELM) offers a unified framework enabling precise description of both monomer composition and connectivity, making it a promising foundation for peptide language modeling. Here, we propose HELM-BERT, the first encoder-based peptide language model trained on HELM notation. Based on DeBERTa, HELM-BERT is specifically designed to capture hierarchical dependencies within HELM sequences. The model is pre-trained on a curated corpus of 39,079 chemically diverse peptides spanning linear and cyclic structures. HELM-BERT significantly outperforms state-of-the-art SMILES-based language models in downstream tasks, including cyclic peptide membrane permeability prediction and peptide-protein interaction prediction. These results demonstrate that HELM's explicit monomer- and topology-aware representations offer substantial data-efficiency advantages for modeling therapeutic peptides, bridging a long-standing gap between small-molecule and protein language models.
Problem

Research questions and friction points this paper is trying to address.

Predicts peptide properties using hierarchical HELM notation
Overcomes limitations of SMILES and amino-acid representations
Improves modeling of cyclic structures and chemical modifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

HELM notation captures peptide monomer composition and connectivity
HELM-BERT model based on DeBERTa captures hierarchical dependencies
Pre-trained on diverse peptides for property prediction tasks
🔎 Similar Papers
No similar papers found.
Seungeon Lee
Seungeon Lee
Max Planck Institute for Software Systems
Natural Language ProcessingRepresentation LearningResponsible AI
T
Takuto Koyama
Graduate School of Medicine, Kyoto University, Kyoto, Japan
I
Itsuki Maeda
Graduate School of Medicine, Kyoto University, Kyoto, Japan
S
Shigeyuki Matsumoto
Graduate School of Medicine, Kyoto University, Kyoto, Japan
Y
Yasushi Okuno
Graduate School of Medicine, Kyoto University, Kyoto, Japan