VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the tendency of self-supervised learning methods to rely on background textures and co-occurrence statistics, often neglecting foreground objects. To overcome this limitation, the authors propose VINO, a novel framework that leverages structural priors in dense real-world videos to generate foreground-aligned joint views and object-conditioned scene views, replacing conventional pseudo-labels and effectively disrupting contextual shortcuts. Within a teacher–student architecture, VINO enforces object-centric invariant representations through asymmetric distillation, mask-guided local augmentation, cross-temporal object persistence constraints, and a structural information bottleneck. Evaluated on PASCAL VOC, VINO achieves a CorLoc score of 34.8, substantially outperforming existing dense video and motion-guided approaches, and produces highly focused representations with a pronounced shape bias.

Technology Category

Application Category

📝 Abstract

Self-supervised learning (SSL) has made rapid progress, yet learned features often over-rely on contextual shortcuts-background textures and co-occurrence statistics. While video provides rich temporal variation, dense in-the-wild streams with strong ego-motion create a co-occurrence trap: foreground objects and background context move coherently, encouraging representations to collapse into scene encoders. To address this, we propose VINO (Video-driven Invariance for Non-Contextual Objects), a teacher-student framework that learns robust image encoders from dense video by imposing a structural information bottleneck. Using a class-agnostic structural prior solely to generate views-not as semantic pseudo-labels-VINO forms an asymmetric distillation problem. The teacher predicts from a foreground-union view with the background suppressed, while the student observes object-conditioned scene views that retain surrounding context but remove competing instances. Matching these targets via masked distillation makes background cues unreliable, pushing the representation toward object-centric invariances. We further enforce temporal object permanence via teacher-anchored cross-time distillation over track-matched objects, and stabilize part-to-whole consistency with mask-guided local views. Through attention visualization and unsupervised object discovery on PASCAL VOC, we demonstrate that VINO effectively disentangles foreground from background. Pretrained on the dense Walking Tours Venice video, VINO achieves 34.8 CorLoc, yielding highly focused, shape-biased representations that substantially outperform prior dense-video and motion-guided SSL baselines.

Problem

Research questions and friction points this paper is trying to address.

self-supervised learning

contextual bias

video representation

object-centric invariance

foreground-background disentanglement

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised learning

video-driven invariance

de-contextualization