RemixFusion: Residual-based Mixed Representation for Large-scale Online RGB-D Reconstruction

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing neural implicit representations for large-scale online RGB-D reconstruction suffer from insufficient geometric detail, prolonged training time, and poor convergence in pose optimization. To address these issues, we propose a residual hybrid representation: a TSDF grid serves as the explicit geometric foundation, while a lightweight neural residual module captures high-frequency geometric details. We introduce a local moving volumetric partitioning scheme coupled with a divide-and-conquer online learning mechanism to enable efficient incremental updates. Instead of optimizing absolute camera poses, we jointly optimize inter-frame pose deltas and incorporate an adaptive gradient amplification strategy to accelerate convergence and improve global consistency. Experiments demonstrate that our method significantly outperforms current state-of-the-art approaches on large-scale scenes, achieving a superior trade-off among reconstruction accuracy, geometric fidelity, and real-time performance.

Technology Category

Application Category

📝 Abstract

The introduction of the neural implicit representation has notably propelled the advancement of online dense reconstruction techniques. Compared to traditional explicit representations, such as TSDF, it improves the mapping completeness and memory efficiency. However, the lack of reconstruction details and the time-consuming learning of neural representations hinder the widespread application of neural-based methods to large-scale online reconstruction. We introduce RemixFusion, a novel residual-based mixed representation for scene reconstruction and camera pose estimation dedicated to high-quality and large-scale online RGB-D reconstruction. In particular, we propose a residual-based map representation comprised of an explicit coarse TSDF grid and an implicit neural module that produces residuals representing fine-grained details to be added to the coarse grid. Such mixed representation allows for detail-rich reconstruction with bounded time and memory budget, contrasting with the overly-smoothed results by the purely implicit representations, thus paving the way for high-quality camera tracking. Furthermore, we extend the residual-based representation to handle multi-frame joint pose optimization via bundle adjustment (BA). In contrast to the existing methods, which optimize poses directly, we opt to optimize pose changes. Combined with a novel technique for adaptive gradient amplification, our method attains better optimization convergence and global optimality. Furthermore, we adopt a local moving volume to factorize the mixed scene representation with a divide-and-conquer design to facilitate efficient online learning in our residual-based framework. Extensive experiments demonstrate that our method surpasses all state-of-the-art ones, including those based either on explicit or implicit representations, in terms of the accuracy of both mapping and tracking on large-scale scenes.

Problem

Research questions and friction points this paper is trying to address.

Improves detail and efficiency in large-scale RGB-D reconstruction

Combines explicit and implicit representations for richer scene details

Enhances camera pose estimation via residual-based optimization techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Residual-based mixed TSDF and neural representation

Adaptive gradient amplification for pose optimization

Local moving volume for efficient online learning

🔎 Similar Papers

GaussianBlock: Building Part-Aware Compositional and Editable 3D Scene by Primitives and Gaussians