🤖 AI Summary
While Joint-Embedding Predictive Architecture (JEPA) excels in general-purpose representation learning, its dense embeddings suffer from poor interpretability and low computational efficiency.
Method: We propose Sparse-JEPA—the first JEPA variant incorporating structured sparsity regularization and grouped latent variable sharing. Specifically, we enforce group-wise sparsity constraints to encourage semantically related features to share latent variables, thereby enhancing representational disentanglement and compactness without compromising predictive fidelity. We theoretically establish that grouping reduces latent variable multi-information and satisfies the data processing inequality.
Results: Pretraining a lightweight Vision Transformer on CIFAR-100 with Sparse-JEPA yields substantial gains in linear-probe classification accuracy. Moreover, representations exhibit improved generalization to downstream tasks and are more interpretable, object-centric, and semantically disentangled—demonstrating both empirical efficacy and principled design.
📝 Abstract
Joint Embedding Predictive Architectures (JEPA) have emerged as a powerful framework for learning general-purpose representations. However, these models often lack interpretability and suffer from inefficiencies due to dense embedding representations. We propose SparseJEPA, an extension that integrates sparse representation learning into the JEPA framework to enhance the quality of learned representations. SparseJEPA employs a penalty method that encourages latent space variables to be shared among data features with strong semantic relationships, while maintaining predictive performance. We demonstrate the effectiveness of SparseJEPA by training on the CIFAR-100 dataset and pre-training a lightweight Vision Transformer. The improved embeddings are utilized in linear-probe transfer learning for both image classification and low-level tasks, showcasing the architecture's versatility across different transfer tasks. Furthermore, we provide a theoretical proof that demonstrates that the grouping mechanism enhances representation quality. This was done by displaying that grouping reduces Multiinformation among latent-variables, including proofing the Data Processing Inequality for Multiinformation. Our results indicate that incorporating sparsity not only refines the latent space but also facilitates the learning of more meaningful and interpretable representations. In further work, hope to further extend this method by finding new ways to leverage the grouping mechanism through object-centric representation learning.