🤖 AI Summary
Existing self-supervised temporal representation learning heavily relies on hand-crafted data augmentations, which require domain expertise and often introduce bias, compromising generalization.
Method: We propose the first augmentation-free unsupervised framework: instead of applying explicit augmentations, it generates intrinsically geometrically distinct multi-view representations via orthogonal bases and overcomplete frames; views are constructed through frame-wise projections, and manifold geometric discrepancy serves as the self-supervised signal. The method operates within a contrastive learning paradigm, jointly optimizing orthogonal transformations and overcomplete representations.
Contribution/Results: Our approach achieves state-of-the-art performance across five temporal learning tasks and nine benchmark datasets, outperforming prior methods by 15–20% in average accuracy—particularly excelling in complex signal domains where effective augmentation design is infeasible.
📝 Abstract
Self-supervised learning (SSL) has emerged as a powerful paradigm for learning representations without labeled data. Most SSL approaches rely on strong, well-established, handcrafted data augmentations to generate diverse views for representation learning. However, designing such augmentations requires domain-specific knowledge and implicitly imposes representational invariances on the model, which can limit generalization. In this work, we propose an unsupervised representation learning method that replaces augmentations by generating views using orthonormal bases and overcomplete frames. We show that embeddings learned from orthonormal and overcomplete spaces reside on distinct manifolds, shaped by the geometric biases introduced by representing samples in different spaces. By jointly leveraging the complementary geometry of these distinct manifolds, our approach achieves superior performance without artificially increasing data diversity through strong augmentations. We demonstrate the effectiveness of our method on nine datasets across five temporal sequence tasks, where signal-specific characteristics make data augmentations particularly challenging. Without relying on augmentation-induced diversity, our method achieves performance gains of up to 15--20% over existing self-supervised approaches. Source code: https://github.com/eth-siplab/Learning-with-FrameProjections