GViT: Representing Images as Gaussians for Visual Recognition

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of fixed-grid input representations in visual recognition. We propose a novel image representation paradigm based on learnable 2D Gaussian distributions. Specifically, an image is explicitly modeled as a set of differentiable, parameter-optimized 2D Gaussian kernels; these are rendered into feature maps via differentiable rendering and jointly trained end-to-end with a standard Vision Transformer (ViT) classifier. Our key contribution lies in leveraging classification gradients to dynamically guide the spatial distribution of Gaussians toward class-discriminative regions, enabling task-aware, co-optimization of representation and recognition. Evaluated on ImageNet-1k, our method achieves 76.9% top-1 accuracy using the ViT-Base architecture—performance competitive with conventional patch-based ViTs. This constitutes the first empirical validation of continuous, learnable, geometry-aware Gaussian representations for mainstream visual recognition, demonstrating both their effectiveness and feasibility.

Technology Category

Application Category

📝 Abstract
We introduce GVIT, a classification framework that abandons conventional pixel or patch grid input representations in favor of a compact set of learnable 2D Gaussians. Each image is encoded as a few hundred Gaussians whose positions, scales, orientations, colors, and opacities are optimized jointly with a ViT classifier trained on top of these representations. We reuse the classifier gradients as constructive guidance, steering the Gaussians toward class-salient regions while a differentiable renderer optimizes an image reconstruction loss. We demonstrate that by 2D Gaussian input representations coupled with our GVIT guidance, using a relatively standard ViT architecture, closely matches the performance of a traditional patch-based ViT, reaching a 76.9% top-1 accuracy on Imagenet-1k using a ViT-B architecture.
Problem

Research questions and friction points this paper is trying to address.

Replacing pixel grids with learnable 2D Gaussians for image representation
Optimizing Gaussian parameters jointly with ViT classifier for visual recognition
Achieving competitive accuracy using Gaussian-based inputs instead of patches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses learnable 2D Gaussians for image representation
Optimizes Gaussian parameters with ViT classifier
Employs differentiable renderer for reconstruction guidance
🔎 Similar Papers
No similar papers found.
J
Jefferson Hernandez
Rice University
R
Ruozhen He
Rice University
Guha Balakrishnan
Guha Balakrishnan
Assistant Professor, Rice University
Computer visionmedical imaging
A
Alexander C. Berg
University of California, Irvine
V
Vicente Ordonez
Rice University