đ¤ AI Summary
This work addresses the inefficiency and limited scalability of joint geometric and semantic optimization in indoor neural surface reconstruction. To overcome these challenges, the authors propose FSTM, a two-stage optimization framework that first leverages RGB and geometric cues to efficiently recover geometry via a signed distance function (SDF), and subsequently estimates the semantic field independently. This decoupled strategy avoids the computational overhead of complex multi-SDF architectures and end-to-end joint training. Without requiring specialized modules, FSTM achieves a 2.3Ã speedup in training on the Replica dataset, demonstrates greater robustness to real-world scene noise on ScanNet++, and recovers more object surfaces with higher recall, significantly improving both reconstruction efficiency and quality.
đ Abstract
Neural Surface Reconstruction has become a standard methodology for indoor 3D reconstruction, with Signed Distance Functions (SDFs) proving particularly effective for representing scene geometry. A variety of applications require a detailed understanding of the scene context, driving the need for object-level semantic signals. While recent methods successfully integrate semantic labels, they often inherit the slow training time and limited scalability of multi-SDF learning. In this paper, we introduce FSTM, a unified approach for learning geometry and semantics through a two-step process: a geometry warm-up using RGB inputs and geometric cues, followed by semantic field estimation. By first optimising geometry without semantic supervision, we observe substantial improvements compared to the standard joint optimisation. Rather than relying on specialised modules or complex multi-SDF designs, FSTM shows that a streamlined formulation is sufficient to achieve strong geometric and semantic reconstructions. Experiments on both synthetic and real-world indoor datasets show that our method outperforms multi-SDF approaches. It trains 2.3x faster on Replica, improves robustness to real-world imperfections on ScanNet++, and achieves higher recall by recovering the surfaces of more objects in the scene. The code will be made available at https://remichierchia.github.io/FSTM.