Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

228K/year
🤖 AI Summary
Existing visual world models suffer from geometric and semantic degradation due to Gaussian bottlenecks, which hinder the preservation of 3D structure and physically consistent camera dynamics in compressed representations. This work proposes the S²VAE framework, which introduces, for the first time, a product-form Power Spherical distribution into the bottleneck of a variational autoencoder, replacing conventional Gaussian latent variables with hyperspherical ones to explicitly align the latent space topology with 3D geometric structure. By adhering to a geometry-first principle, the model learns scene depth, camera pose, and point-level structure directly from observations. Evaluated under high compression rates, S²VAE significantly outperforms traditional approaches, demonstrating the efficacy of geometrically aligned latent spaces in tasks including depth estimation, camera pose recovery, and point cloud reconstruction.
📝 Abstract
Modern visual world modeling systems increasingly rely on high-capacity architectures and large-scale data to produce plausible motion, yet they often fail to preserve underlying 3D geometry or physically consistent camera dynamics. A key limitation lies not only in model capacity, but in the latent representations used to encode geometric structure. We propose S$^2$VAE, a geometry-first latent learning framework that focuses on compressing and representing the latent 3D state of a scene, including camera motion, depth, and point-level structure, rather than modeling appearance alone. Building on representations from a Visual Geometry Grounded Transformer (VGGT), we introduce a novel type of variational autoencoder using a product of Power Spherical latent distributions, explicitly enforcing hyperspherical structure in the bottleneck to preserve directional and geometric semantics under strong compression. Across depth estimation, camera pose recovery, and point cloud reconstruction, we show that geometry-aligned hyperspherical latents consistently outperform conventional Gaussian bottlenecks, particularly in high-compression regimes. Our results highlight latent geometry as a first-class design choice for physically grounded visual and world models.
Problem

Research questions and friction points this paper is trying to address.

3D geometry
latent representation
camera dynamics
geometric structure
visual world modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

hyperspherical latent
geometry-aware representation
Power Spherical VAE
Visual Geometry Grounded Transformer
non-Gaussian bottleneck
🔎 Similar Papers
No similar papers found.