🤖 AI Summary
To address the challenge of simultaneously achieving fine-grained geometric reconstruction (e.g., text) and real-time rendering in neural scene reconstruction from high-resolution multi-view inputs (up to 64 views, 950×540), this work proposes a semi-explicit scene representation coupled with a lightweight Transformer decoder—bypassing the sequential depth-decompression bottleneck inherent in conventional implicit methods. Our approach integrates an enhanced Gaussian Splatting representation, joint RGB-depth supervision, and wide-baseline sequence modeling. To our knowledge, it is the first method to achieve both fine-grained geometry reconstruction and real-time rendering at 14 FPS on an A100 GPU under 32–64-view settings. Rendering quality matches that of LaCT (DL3DV), while depth prediction on ScanNetv2 establishes a new state-of-the-art, significantly outperforming Long-LRM and standard Gaussian Splatting baselines.
📝 Abstract
Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950 imes540$ resolution, achieving 360° scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950 imes540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.