🤖 AI Summary
Existing protein representation learning models rely on fixed-scale neighborhoods (e.g., k-hop), failing to capture the diverse sizes and topologies of functionally relevant substructures, thereby distorting functional signals. To address this, we propose BioBlobs: an end-to-end differentiable graph segmentation module that dynamically partitions protein structures into flexible-size, non-overlapping “blobs”, and constructs a shared, interpretable, function-oriented discrete substructure vocabulary via vector quantization. Integrating graph neural networks with soft clustering, BioBlobs enables the first adaptive discovery and discrete encoding of functional units. Evaluated on mainstream encoders such as GVP-GNN, it significantly improves performance across secondary structure prediction, contact map inference, and functional annotation tasks. Moreover, BioBlobs provides mechanistic interpretability—linking learned discrete substructures directly to biochemical function—thereby bridging structural representation and functional understanding in a principled, differentiable framework.
📝 Abstract
Protein function is driven by coherent substructures which vary in size and topology, yet current protein representation learning models (PRL) distort these signals by relying on rigid substructures such as k-hop and fixed radius neighbourhoods. We introduce BioBlobs, a plug-and-play, fully differentiable module that represents proteins by dynamically partitioning structures into flexibly-sized, non-overlapping substructures ("blobs"). The resulting blobs are quantized into a shared and interpretable codebook, yielding a discrete vocabulary of function-relevant protein substructures used to compute protein embeddings. We show that BioBlobs representations improve the performance of widely used protein encoders such as GVP-GNN across various PRL tasks. Our approach highlights the value of architectures that directly capture function-relevant protein substructures, enabling both improved predictive performance and mechanistic insight into protein function.