🤖 AI Summary
Capsule Networks (CapsNets) have long suffered from limited scalability to complex vision tasks due to reliance on small-scale datasets and shallow architectures. To address this, we introduce the first self-supervised pretraining paradigm tailored for capsule representations—Masked Image Modeling for Capsules (MIM-Caps). Our method randomly masks image patches and reconstructs original capsule features via a novel capsule-based decoder, enabling structural-aware representation learning. Subsequent supervised fine-tuning adapts the pretrained model to downstream tasks. This approach overcomes CapsNets’ traditional dependence on strong supervision and hand-crafted priors, significantly enhancing scalability and generalization. On ImageNet-mini (Imagennette), MIM-Caps achieves state-of-the-art performance among CapsNets, improving top-1 accuracy by 9% over the baseline. Our results empirically validate the effectiveness and broad applicability of self-supervised pretraining for capsule networks.
📝 Abstract
We propose Masked Capsule Autoencoders (MCAE), the first Capsule Network that utilises pretraining in a modern self-supervised paradigm, specifically the masked image modelling framework. Capsule Networks have emerged as a powerful alternative to Convolutional Neural Networks (CNNs). They have shown favourable properties when compared to Vision Transformers (ViT), but have struggled to effectively learn when presented with more complex data. This has led to Capsule Network models that do not scale to modern tasks. Our proposed MCAE model alleviates this issue by reformulating the Capsule Network to use masked image modelling as a pretraining stage before finetuning in a supervised manner. Across several experiments and ablations studies we demonstrate that similarly to CNNs and ViTs, Capsule Networks can also benefit from self-supervised pretraining, paving the way for further advancements in this neural network domain. For instance, by pretraining on the Imagenette dataset-consisting of 10 classes of Imagenet-sized images-we achieve state-of-the-art results for Capsule Networks, demonstrating a 9% improvement compared to our baseline model. Thus, we propose that Capsule Networks benefit from and should be trained within a masked image modelling framework, using a novel capsule decoder, to enhance a Capsule Network's performance on realistically sized images.