🤖 AI Summary
Existing protein shape comparison methods lack completeness and continuity under rigid-body motions, hindering discrimination of subtle conformational differences and limiting structural recognition accuracy and machine learning generalizability.
Method: We introduce the first rigid-motion-complete and bi-Lipschitz (i.e., both Lipschitz continuous and co-Lipschitz) invariant for protein backbones, satisfying metric axioms. Our approach integrates geometric topology, metric analysis, and invariant theory of the rigid-motion group to construct a low-dimensional invariant space that uniquely and stably encodes 3D backbone geometry.
Contribution/Results: The proposed invariant is theoretically guaranteed to be continuous, discriminative, and complete under Euclidean isometries. Experiments on the PDB demonstrate precise identification of thousands of near-duplicate structures, with substantial improvements in robustness and interpretability of downstream machine learning models—particularly in tasks requiring fine-grained structural discrimination.
📝 Abstract
Proteins are large biomolecules that regulate all living organisms and consist of one or several chains. The primary structure of a protein chain is a sequence of amino acid residues whose three main atoms (alpha-carbon, nitrogen, and carbonyl carbon) form a protein backbone. The tertiary structure is the rigid shape of a protein chain represented by atomic positions in 3-dimensional space. Because different geometric structures often have distinct functional properties, it is important to continuously quantify differences in rigid shapes of protein backbones. Unfortunately, many widely used similarities of proteins fail axioms of a distance metric and discontinuously change under tiny perturbations of atoms. This paper develops a complete invariant that identifies any protein backbone in 3-dimensional space, uniquely under rigid motion. This invariant is Lipschitz bi-continuous in the sense that it changes up to a constant multiple of a maximum perturbation of atoms, and vice versa. The new invariant has been used to detect thousands of (near-)duplicates in the Protein Data Bank, whose presence inevitably skews machine learning predictions. The resulting invariant space allows low-dimensional maps with analytically defined coordinates that reveal substantial variability in the protein universe.