SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address severe redundancy and geometric/photometric inconsistency in 3D Gaussian Splatting (3DGS) for long-duration video sequences—caused by per-pixel prediction and cross-view Gaussian stacking—this paper proposes a structure-aware online general 3D reconstruction framework. Methodologically, it introduces compact anchor primitives and differentiable saliency-aware Gaussian quantization to suppress redundancy (reducing Gaussians by 50%–90%), employs a 3D point Transformer to explicitly enforce geometric consistency, and adopts region-adaptive decoding for efficient single-pass reconstruction. To our knowledge, this is the first method enabling online 3DGS reconstruction from unposed images at ≥50 frames and >10 FPS, without camera calibration or test-time optimization. Extensive evaluations across multiple datasets demonstrate significant improvements in novel-view synthesis and depth estimation, while achieving high efficiency, strong robustness, and superior generalization to long temporal sequences.

Technology Category

Application Category

📝 Abstract

Recent advances in 3D Gaussian Splatting (3DGS) have enabled generalizable, on-the-fly reconstruction of sequential input views. However, existing methods often predict per-pixel Gaussians and combine Gaussians from all views as the scene representation, leading to substantial redundancies and geometric inconsistencies in long-duration video sequences. To address this, we propose SaLon3R, a novel framework for Structure-aware, Long-term 3DGS Reconstruction. To our best knowledge, SaLon3R is the first online generalizable GS method capable of reconstructing over 50 views in over 10 FPS, with 50% to 90% redundancy removal. Our method introduces compact anchor primitives to eliminate redundancy through differentiable saliency-aware Gaussian quantization, coupled with a 3D Point Transformer that refines anchor attributes and saliency to resolve cross-frame geometric and photometric inconsistencies. Specifically, we first leverage a 3D reconstruction backbone to predict dense per-pixel Gaussians and a saliency map encoding regional geometric complexity. Redundant Gaussians are compressed into compact anchors by prioritizing high-complexity regions. The 3D Point Transformer then learns spatial structural priors in 3D space from training data to refine anchor attributes and saliency, enabling regionally adaptive Gaussian decoding for geometric fidelity. Without known camera parameters or test-time optimization, our approach effectively resolves artifacts and prunes the redundant 3DGS in a single feed-forward pass. Experiments on multiple datasets demonstrate our state-of-the-art performance on both novel view synthesis and depth estimation, demonstrating superior efficiency, robustness, and generalization ability for long-term generalizable 3D reconstruction. Project Page: https://wrld.github.io/SaLon3R/.

Problem

Research questions and friction points this paper is trying to address.

Reduces redundancy in 3D Gaussian Splatting for long video sequences

Resolves geometric inconsistencies in unposed multi-view reconstruction

Enables efficient real-time 3D reconstruction without camera parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses compact anchor primitives to remove redundancy

Employs 3D Point Transformer for refining anchor attributes

Performs feed-forward reconstruction without camera parameters

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View