Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing visual world models suffer from geometric and semantic degradation due to Gaussian bottlenecks, which hinder the preservation of 3D structure and physically consistent camera dynamics in compressed representations. This work proposes the S²VAE framework, which introduces, for the first time, a product-form Power Spherical distribution into the bottleneck of a variational autoencoder, replacing conventional Gaussian latent variables with hyperspherical ones to explicitly align the latent space topology with 3D geometric structure. By adhering to a geometry-first principle, the model learns scene depth, camera pose, and point-level structure directly from observations. Evaluated under high compression rates, S²VAE significantly outperforms traditional approaches, demonstrating the efficacy of geometrically aligned latent spaces in tasks including depth estimation, camera pose recovery, and point cloud reconstruction.

📝 Abstract

Modern visual world modeling systems increasingly rely on high-capacity architectures and large-scale data to produce plausible motion, yet they often fail to preserve underlying 3D geometry or physically consistent camera dynamics. A key limitation lies not only in model capacity, but in the latent representations used to encode geometric structure. We propose S$^2$VAE, a geometry-first latent learning framework that focuses on compressing and representing the latent 3D state of a scene, including camera motion, depth, and point-level structure, rather than modeling appearance alone. Building on representations from a Visual Geometry Grounded Transformer (VGGT), we introduce a novel type of variational autoencoder using a product of Power Spherical latent distributions, explicitly enforcing hyperspherical structure in the bottleneck to preserve directional and geometric semantics under strong compression. Across depth estimation, camera pose recovery, and point cloud reconstruction, we show that geometry-aligned hyperspherical latents consistently outperform conventional Gaussian bottlenecks, particularly in high-compression regimes. Our results highlight latent geometry as a first-class design choice for physically grounded visual and world models.

Problem

Research questions and friction points this paper is trying to address.

3D geometry

latent representation

camera dynamics

geometric structure

visual world modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

hyperspherical latent

geometry-aware representation

Power Spherical VAE