E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This paper addresses the problem of geometric uninterpretability and shortcut learning in unsupervised 3D representation learning from multi-view images, stemming from implicit neural representations. We propose the first end-to-end, explicitly geometry-driven self-supervised 3D reconstruction framework. Methodologically, we adopt explicit 3D representations—such as voxels and point clouds—to perform direct 3D-space modeling; enforce multi-view geometric consistency as a self-supervisory signal; and design a fine-grained unsupervised curriculum learning strategy to unify heterogeneous multi-source data training and enable cross-domain alignment. Our key contributions are: (1) establishing the first explicit 3D-space self-supervised reconstruction paradigm, eliminating inherent limitations of implicit representations; (2) achieving state-of-the-art pose estimation accuracy—significantly surpassing RayZer—and matching or exceeding fully supervised VGGT in 3D reconstruction fidelity; and (3) consistently outperforming leading vision foundation models—including DINOv3 and CroCo v2—when transferred to diverse 3D downstream tasks.

Technology Category

Application Category

📝 Abstract

Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.

Problem

Research questions and friction points this paper is trying to address.

Self-supervised 3D reconstruction from unlabeled multi-view images

Learning geometrically grounded 3D-aware visual representations

Eliminating shortcut solutions in 3D representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised 3D reconstruction with explicit geometry

Fine-grained curriculum learning for unsupervised scalability

Direct 3D-aware representation learning from unlabeled images

🔎 Similar Papers

PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm