🤖 AI Summary
This work addresses the limitations of fixed-grid input representations in visual recognition. We propose a novel image representation paradigm based on learnable 2D Gaussian distributions. Specifically, an image is explicitly modeled as a set of differentiable, parameter-optimized 2D Gaussian kernels; these are rendered into feature maps via differentiable rendering and jointly trained end-to-end with a standard Vision Transformer (ViT) classifier. Our key contribution lies in leveraging classification gradients to dynamically guide the spatial distribution of Gaussians toward class-discriminative regions, enabling task-aware, co-optimization of representation and recognition. Evaluated on ImageNet-1k, our method achieves 76.9% top-1 accuracy using the ViT-Base architecture—performance competitive with conventional patch-based ViTs. This constitutes the first empirical validation of continuous, learnable, geometry-aware Gaussian representations for mainstream visual recognition, demonstrating both their effectiveness and feasibility.
📝 Abstract
We introduce GVIT, a classification framework that abandons conventional pixel or patch grid input representations in favor of a compact set of learnable 2D Gaussians. Each image is encoded as a few hundred Gaussians whose positions, scales, orientations, colors, and opacities are optimized jointly with a ViT classifier trained on top of these representations. We reuse the classifier gradients as constructive guidance, steering the Gaussians toward class-salient regions while a differentiable renderer optimizes an image reconstruction loss. We demonstrate that by 2D Gaussian input representations coupled with our GVIT guidance, using a relatively standard ViT architecture, closely matches the performance of a traditional patch-based ViT, reaching a 76.9% top-1 accuracy on Imagenet-1k using a ViT-B architecture.