🤖 AI Summary
Existing self-supervised learning methods predominantly rely on the two-view paradigm (e.g., data augmentation or masking), leading to an inherent trade-off between invariance (e.g., image classification) and equivariance (e.g., fine-grained localization), thereby limiting downstream adaptability. To address this, we propose seq-JEPA: a joint modeling framework that sequences multi-view observations and action embeddings, enabling simultaneous learning of invariant and equivariant representations within a single model—the first of its kind. Its core is an isolation-based dual-representation mechanism, requiring no auxiliary prediction heads or additional loss terms. Built upon a JEPA foundation, seq-JEPA employs an action-conditioned Transformer architecture to realize sequence-aware world modeling. Evaluated across equivariance benchmarks, image classification, path integration, and eye-movement prediction, seq-JEPA achieves state-of-the-art performance, significantly enhancing representation generality and task-specific adaptability.
📝 Abstract
Current self-supervised algorithms mostly rely on transformations such as data augmentation and masking to learn visual representations. This is achieved by inducing invariance or equivariance with respect to these transformations after encoding two views of an image. This dominant two-view paradigm can limit the flexibility of learned representations for downstream adaptation by creating performance trade-offs between invariance-related tasks such as image classification and more fine-grained equivariance-related tasks. In this work, we introduce emph{seq-JEPA}, a world modeling paradigm based on joint-embedding predictive architecture that leverages architectural inductive biases to resolve this trade-off. Without requiring an additional equivariance predictor or loss term, seq-JEPA simultaneously learns two architecturally segregated representations: one equivariant to the specified transformations and another invariant to them and suited for tasks such as classification. To do so, our model processes a short sequence of different views (observations) of an input image. Each encoded view is concatenated with embeddings corresponding to the relative transformation (action) producing the next observation in the sequence. A transformer encoder outputs an aggregate representation of this sequence, which is subsequently conditioned on the action leading to the next observation to predict its representation. Empirically, seq-JEPA achieves strong performance on equivariant benchmarks and image classification without sacrificing one for the other. Additionally, our framework excels at tasks that inherently require aggregating a sequence of observations, such as path integration across actions and predictive learning across eye movements.