🤖 AI Summary
To address the severe degradation in NeRF reconstruction quality under sparse-view settings, this paper introduces a novel “simplification-as-regularization” paradigm: by actively constraining model complexity—e.g., pruning positional encoding, reducing tensor ranks, or limiting hash table capacity—it implicitly induces robust depth priors without external depth supervision. The proposed Minimalist Enhancement Framework is compatible with both implicit and explicit radiance fields, and integrates multi-scale depth consistency constraints with joint optimization to self-supervise high-fidelity depth signal generation. Evaluated on LLFF and Tanks & Temples benchmarks, our method significantly improves novel-view synthesis quality under sparse inputs, achieving state-of-the-art geometric fidelity and rendering performance for both forward-facing and 360° scenes.
📝 Abstract
Neural Radiance Fields (NeRF) show impressive performance in photo-realistic free-view rendering of scenes. Recent improvements on the NeRF such as TensoRF and ZipNeRF employ explicit models for faster optimization and rendering, as compared to the NeRF that employs an implicit representation. However, both implicit and explicit radiance fields require dense sampling of images in the given scene. Their performance degrades significantly when only a sparse set of views is available. Researchers find that supervising the depth estimated by a radiance field helps train it effectively with fewer views. The depth supervision is obtained either using classical approaches or neural networks pre-trained on a large dataset. While the former may provide only sparse supervision, the latter may suffer from generalization issues. As opposed to the earlier approaches, we seek to learn the depth supervision by designing augmented models and training them along with the main radiance field. Further, we aim to design a framework of regularizations that can work across different implicit and explicit radiance fields. We observe that certain features of these radiance field models overfit to the observed images in the sparse-input scenario. Our key finding is that reducing the capability of the radiance fields with respect to positional encoding, the number of decomposed tensor components or the size of the hash table, constrains the model to learn simpler solutions, which estimate better depth in certain regions. By designing augmented models based on such reduced capabilities, we obtain better depth supervision for the main radiance field. We achieve state-of-the-art view-synthesis performance with sparse input views on popular datasets containing forward-facing and 360$^circ$ scenes by employing the above regularizations.