Harnessing small projectors and multiple views for efficient vision pretraining

📅 2023-12-17

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Self-supervised learning (SSL) for visual pretraining suffers from low efficiency and heavy reliance on empirical hyperparameter tuning. Method: We propose a theory-driven reformulation of the similarity-matching loss: (i) we equivalently reparameterize the ideal SSL loss, uncovering an implicit trade-off between orthogonality constraint strength and projection dimensionality; (ii) we theoretically prove that multiple augmented views can substitute for large-scale data to improve convergence; and (iii) we design a lightweight nonlinear projection head with a multi-view co-optimization framework, incorporating implicit bias modeling of gradient descent. Contributions/Results: Our method achieves significant gains in linear evaluation accuracy on CIFAR, STL, and ImageNet. It maintains downstream task performance using only 50% of the pretraining data, accelerates convergence, and reduces computational overhead—demonstrating improved sample, parameter, and optimization efficiency.

📝 Abstract

Recent progress in self-supervised (SSL) visual representation learning has led to the development of several different proposed frameworks that rely on augmentations of images but use different loss functions. However, there are few theoretically grounded principles to guide practice, so practical implementation of each SSL framework requires several heuristics to achieve competitive performance. In this work, we build on recent analytical results to design practical recommendations for competitive and efficient SSL that are grounded in theory. Specifically, recent theory tells us that existing SSL frameworks are minimizing the same idealized loss, which is to learn features that best match the data similarity kernel defined by the augmentations used. We show how this idealized loss can be reformulated to a functionally equivalent loss that is more efficient to compute. We study the implicit bias of using gradient descent to minimize our reformulated loss function and find that using a stronger orthogonalization constraint with a reduced projector dimensionality should yield good representations. Furthermore, the theory tells us that approximating the reformulated loss should be improved by increasing the number of augmentations, and as such using multiple augmentations should lead to improved convergence. We empirically verify our findings on CIFAR, STL and Imagenet datasets, wherein we demonstrate an improved linear readout performance when training a ResNet-backbone using our theoretically grounded recommendations. Remarkably, we also demonstrate that by leveraging these insights, we can reduce the pretraining dataset size by up to 2$ imes$ while maintaining downstream accuracy simply by using more data augmentations. Taken together, our work provides theoretically grounded recommendations that can be used to improve SSL convergence and efficiency.

Problem

Research questions and friction points this paper is trying to address.

Self-Supervised Learning

Visual Learning

Image Transformation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simplified Feature Calculation

Gradient Descent Optimization

Enhanced Data Transformation

🔎 Similar Papers

No similar papers found.