🤖 AI Summary
Existing self-supervised visual models heavily rely on strong data augmentation, limiting their ability to model human-like continuous temporal perception. Method: PhiNet v2 introduces the first mask-free, brain-inspired self-supervised foundation model for video sequences, replacing hand-crafted augmentations with a variational inference mechanism to enable learning of temporally coherent representations. It integrates Transformer architecture with variational temporal modeling to directly extract robust dynamic representations from raw image streams. Contribution/Results: Compared to static-image paradigms, PhiNet v2 achieves significantly improved robustness to temporal perturbations and attains state-of-the-art performance across multiple video understanding benchmarks. Moreover, its learned representations exhibit enhanced biological plausibility—mirroring temporal integration mechanisms observed in the primary visual cortex. This work establishes a novel paradigm for developing brain-inspired, temporally adaptive visual foundation models.
📝 Abstract
Recent advances in self-supervised learning (SSL) have revolutionized computer vision through innovative architectures and learning objectives, yet they have not fully leveraged insights from biological visual processing systems. Recently, a brain-inspired SSL model named PhiNet was proposed; it is based on a ResNet backbone and operates on static image inputs with strong augmentation. In this paper, we introduce PhiNet v2, a novel Transformer-based architecture that processes temporal visual input (that is, sequences of images) without relying on strong augmentation. Our model leverages variational inference to learn robust visual representations from continuous input streams, similar to human visual processing. Through extensive experimentation, we demonstrate that PhiNet v2 achieves competitive performance compared to state-of-the-art vision foundation models, while maintaining the ability to learn from sequential input without strong data augmentation. This work represents a significant step toward more biologically plausible computer vision systems that process visual information in a manner more closely aligned with human cognitive processes.