GViT: Representing Images as Gaussians for Visual Recognition

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of fixed-grid input representations in visual recognition. We propose a novel image representation paradigm based on learnable 2D Gaussian distributions. Specifically, an image is explicitly modeled as a set of differentiable, parameter-optimized 2D Gaussian kernels; these are rendered into feature maps via differentiable rendering and jointly trained end-to-end with a standard Vision Transformer (ViT) classifier. Our key contribution lies in leveraging classification gradients to dynamically guide the spatial distribution of Gaussians toward class-discriminative regions, enabling task-aware, co-optimization of representation and recognition. Evaluated on ImageNet-1k, our method achieves 76.9% top-1 accuracy using the ViT-Base architecture—performance competitive with conventional patch-based ViTs. This constitutes the first empirical validation of continuous, learnable, geometry-aware Gaussian representations for mainstream visual recognition, demonstrating both their effectiveness and feasibility.

Technology Category

Application Category

📝 Abstract

We introduce GVIT, a classification framework that abandons conventional pixel or patch grid input representations in favor of a compact set of learnable 2D Gaussians. Each image is encoded as a few hundred Gaussians whose positions, scales, orientations, colors, and opacities are optimized jointly with a ViT classifier trained on top of these representations. We reuse the classifier gradients as constructive guidance, steering the Gaussians toward class-salient regions while a differentiable renderer optimizes an image reconstruction loss. We demonstrate that by 2D Gaussian input representations coupled with our GVIT guidance, using a relatively standard ViT architecture, closely matches the performance of a traditional patch-based ViT, reaching a 76.9% top-1 accuracy on Imagenet-1k using a ViT-B architecture.

Problem

Research questions and friction points this paper is trying to address.

Replacing pixel grids with learnable 2D Gaussians for image representation

Optimizing Gaussian parameters jointly with ViT classifier for visual recognition

Achieving competitive accuracy using Gaussian-based inputs instead of patches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses learnable 2D Gaussians for image representation

Optimizes Gaussian parameters with ViT classifier

Employs differentiable renderer for reconstruction guidance

🔎 Similar Papers

No similar papers found.

Authors to Follow