Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenges of weak geometry, missing details, and semantic inconsistency in pose-free multi-view imagery by proposing UniSplat, an end-to-end framework for unified 3D representation learning. The method introduces a dual-mask geometry induction mechanism to enhance geometric awareness, employs a coarse-to-fine Gaussian splatting strategy to refine appearance reconstruction, and incorporates a pose-conditioned recalibration module to align geometric and semantic predictions. Leveraging a Transformer encoder, self-supervised learning, and multi-task feature reprojection alignment, UniSplat generates geometrically accurate, visually detailed, and semantically consistent 3D representations from sparse, pose-agnostic inputs, significantly improving generalization performance on downstream spatial intelligence and embodied AI tasks.

Technology Category

Application Category

📝 Abstract

Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics. We introduce UniSplat, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a dual-masking strategy that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs. Second, we develop a coarse-to-fine Gaussian splatting strategy that reduces appearance-semantics inconsistencies by progressively refining the radiance field. Finally, to enforce geometric-semantic consistency, we introduce a pose-conditioned recalibration mechanism that interrelates the outputs of multiple heads by re-projecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency, thereby resolving geometry-semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.

Problem

Research questions and friction points this paper is trying to address.

3D representation learning

unposed multi-view images

spatial intelligence

geometry-semantics consistency

self-supervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

UniSplat

Gaussian splatting

geometry-semantics consistency