BioBlobs: Differentiable Graph Partitioning for Protein Representation Learning

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing protein representation learning models rely on fixed-scale neighborhoods (e.g., k-hop), failing to capture the diverse sizes and topologies of functionally relevant substructures, thereby distorting functional signals. To address this, we propose BioBlobs: an end-to-end differentiable graph segmentation module that dynamically partitions protein structures into flexible-size, non-overlapping “blobs”, and constructs a shared, interpretable, function-oriented discrete substructure vocabulary via vector quantization. Integrating graph neural networks with soft clustering, BioBlobs enables the first adaptive discovery and discrete encoding of functional units. Evaluated on mainstream encoders such as GVP-GNN, it significantly improves performance across secondary structure prediction, contact map inference, and functional annotation tasks. Moreover, BioBlobs provides mechanistic interpretability—linking learned discrete substructures directly to biochemical function—thereby bridging structural representation and functional understanding in a principled, differentiable framework.

Technology Category

Application Category

📝 Abstract

Protein function is driven by coherent substructures which vary in size and topology, yet current protein representation learning models (PRL) distort these signals by relying on rigid substructures such as k-hop and fixed radius neighbourhoods. We introduce BioBlobs, a plug-and-play, fully differentiable module that represents proteins by dynamically partitioning structures into flexibly-sized, non-overlapping substructures ("blobs"). The resulting blobs are quantized into a shared and interpretable codebook, yielding a discrete vocabulary of function-relevant protein substructures used to compute protein embeddings. We show that BioBlobs representations improve the performance of widely used protein encoders such as GVP-GNN across various PRL tasks. Our approach highlights the value of architectures that directly capture function-relevant protein substructures, enabling both improved predictive performance and mechanistic insight into protein function.

Problem

Research questions and friction points this paper is trying to address.

Dynamic partitioning of protein structures into flexible substructures

Overcoming distortion from rigid substructures in protein representation

Creating interpretable vocabulary of function-relevant protein substructures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic graph partitioning for flexible protein substructures

Quantizing substructures into interpretable codebook vocabulary

Differentiable module improving protein encoder performance

🔎 Similar Papers

GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning