🤖 AI Summary
Joint image alignment (JA) faces challenges including high computational complexity, difficulty in modeling geometric distortions, and susceptibility to local optima. Existing Vision Transformer (ViT)-based approaches rely heavily on strong regularization and atlas maintenance, resulting in parameter-heavy models and inefficient training. This paper proposes a lightweight, end-to-end JA framework that eliminates both explicit regularization and atlas maintenance. Our method employs a compact spatial modeling architecture—stripping away ViT’s redundant components and conventional regularizers—to directly regress deformation fields. With only 16K parameters, the model achieves state-of-the-art alignment accuracy on SPair-71K and CUB, while accelerating both training and inference by at least 10× and significantly reducing GPU memory consumption. The core contribution is the first demonstration of high-accuracy, high-efficiency unsupervised joint alignment under an extremely minimal architectural design.
📝 Abstract
The unsupervised task of Joint Alignment (JA) of images is beset by challenges such as high complexity, geometric distortions, and convergence to poor local or even global optima. Although Vision Transformers (ViT) have recently provided valuable features for JA, they fall short of fully addressing these issues. Consequently, researchers frequently depend on expensive models and numerous regularization terms, resulting in long training times and challenging hyperparameter tuning. We introduce the Spatial Joint Alignment Model (SpaceJAM), a novel approach that addresses the JA task with efficiency and simplicity. SpaceJAM leverages a compact architecture with only 16K trainable parameters and uniquely operates without the need for regularization or atlas maintenance. Evaluations on SPair-71K and CUB datasets demonstrate that SpaceJAM matches the alignment capabilities of existing methods while significantly reducing computational demands and achieving at least a 10x speedup. SpaceJAM sets a new standard for rapid and effective image alignment, making the process more accessible and efficient. Our code is available at: https://bgu-cs-vil.github.io/SpaceJAM/.