PhiNet v2: A Mask-Free Brain-Inspired Vision Foundation Model from Video

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing self-supervised visual models heavily rely on strong data augmentation, limiting their ability to model human-like continuous temporal perception. Method: PhiNet v2 introduces the first mask-free, brain-inspired self-supervised foundation model for video sequences, replacing hand-crafted augmentations with a variational inference mechanism to enable learning of temporally coherent representations. It integrates Transformer architecture with variational temporal modeling to directly extract robust dynamic representations from raw image streams. Contribution/Results: Compared to static-image paradigms, PhiNet v2 achieves significantly improved robustness to temporal perturbations and attains state-of-the-art performance across multiple video understanding benchmarks. Moreover, its learned representations exhibit enhanced biological plausibility—mirroring temporal integration mechanisms observed in the primary visual cortex. This work establishes a novel paradigm for developing brain-inspired, temporally adaptive visual foundation models.

Technology Category

Application Category

📝 Abstract

Recent advances in self-supervised learning (SSL) have revolutionized computer vision through innovative architectures and learning objectives, yet they have not fully leveraged insights from biological visual processing systems. Recently, a brain-inspired SSL model named PhiNet was proposed; it is based on a ResNet backbone and operates on static image inputs with strong augmentation. In this paper, we introduce PhiNet v2, a novel Transformer-based architecture that processes temporal visual input (that is, sequences of images) without relying on strong augmentation. Our model leverages variational inference to learn robust visual representations from continuous input streams, similar to human visual processing. Through extensive experimentation, we demonstrate that PhiNet v2 achieves competitive performance compared to state-of-the-art vision foundation models, while maintaining the ability to learn from sequential input without strong data augmentation. This work represents a significant step toward more biologically plausible computer vision systems that process visual information in a manner more closely aligned with human cognitive processes.

Problem

Research questions and friction points this paper is trying to address.

Develops brain-inspired vision model without strong data augmentation

Learns visual representations from continuous video input streams

Achieves competitive performance with biologically plausible processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based architecture for temporal visual input

Variational inference for robust visual representations

Mask-free learning from sequential input without augmentation

🔎 Similar Papers

Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity