Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously achieving fine-grained geometric reconstruction (e.g., text) and real-time rendering in neural scene reconstruction from high-resolution multi-view inputs (up to 64 views, 950×540), this work proposes a semi-explicit scene representation coupled with a lightweight Transformer decoder—bypassing the sequential depth-decompression bottleneck inherent in conventional implicit methods. Our approach integrates an enhanced Gaussian Splatting representation, joint RGB-depth supervision, and wide-baseline sequence modeling. To our knowledge, it is the first method to achieve both fine-grained geometry reconstruction and real-time rendering at 14 FPS on an A100 GPU under 32–64-view settings. Rendering quality matches that of LaCT (DL3DV), while depth prediction on ScanNetv2 establishes a new state-of-the-art, significantly outperforming Long-LRM and standard Gaussian Splatting baselines.

Technology Category

Application Category

📝 Abstract
Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950 imes540$ resolution, achieving 360° scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950 imes540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.
Problem

Research questions and friction points this paper is trying to address.

Preserving fine details in feed-forward wide-coverage 3D scene reconstruction
Achieving real-time rendering while maintaining implicit representation quality
Scaling reconstruction to higher input view counts without sacrificing detail
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-explicit scene representation for detail preservation
Lightweight decoder enabling real-time rendering at 14 FPS
Scales to 64 input views with strong generalization
🔎 Similar Papers
2024-06-09International Conference on Learning RepresentationsCitations: 7
C
Chen Ziwen
Adobe Research
Hao Tan
Hao Tan
Adobe Research
Vision and Language3D Multimodal
P
Peng Wang
Tripo AI
Zexiang Xu
Zexiang Xu
Hillbot
Computer VisionComputer GraphicsFoundation Models
L
Li Fuxin
Oregon State University