GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis

πŸ“… 2025-05-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing neural radiance fields (NeRFs) suffer from severe degradation in generalization and reconstruction quality under few-view settings (1–3 input views). To address this, we propose Neural Rendering Transformer (NRT), the first framework featuring a global-local dual-path feature fusion mechanism: the global path models scene-level semantic context, while the local path encodes epipolar geometric constraints. NRT further introduces 3D sparse attention and kernel regression-guided adaptive ray sampling to enhance sampling efficiency and geometric fidelity. Crucially, NRT requires no scene-specific priors or fine-tuningβ€”only a minimal number of input views suffice for high-fidelity novel view synthesis. Extensive experiments demonstrate that NRT significantly outperforms state-of-the-art methods across multiple benchmarks, achieving superior performance in PSNR, SSIM, and depth error metrics. Notably, under the most challenging 1–2-view configurations, NRT excels in geometric consistency and fine-grained texture recovery.

Technology Category

Application Category

πŸ“ Abstract
Neural Radiance Fields (NeRF) have transformed novel view synthesis by modeling scene-specific volumetric representations directly from images. While generalizable NeRF models can generate novel views across unknown scenes by learning latent ray representations, their performance heavily depends on a large number of multi-view observations. However, with limited input views, these methods experience significant degradation in rendering quality. To address this limitation, we propose GoLF-NRT: a Global and Local feature Fusion-based Neural Rendering Transformer. GoLF-NRT enhances generalizable neural rendering from few input views by leveraging a 3D transformer with efficient sparse attention to capture global scene context. In parallel, it integrates local geometric features extracted along the epipolar line, enabling high-quality scene reconstruction from as few as 1 to 3 input views. Furthermore, we introduce an adaptive sampling strategy based on attention weights and kernel regression, improving the accuracy of transformer-based neural rendering. Extensive experiments on public datasets show that GoLF-NRT achieves state-of-the-art performance across varying numbers of input views, highlighting the effectiveness and superiority of our approach. Code is available at https://github.com/KLMAV-CUC/GoLF-NRT.
Problem

Research questions and friction points this paper is trying to address.

Improves few-shot view synthesis with global-local feature fusion
Enhances neural rendering from 1-3 views using 3D transformer
Introduces adaptive sampling for accurate transformer-based rendering
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D transformer with sparse attention
Local geometric feature integration
Adaptive sampling with attention weights
πŸ”Ž Similar Papers
No similar papers found.
Y
You Wang
Key Laboratory of Media Audio and Video (Communication University of China), Ministry of Education, Beijing 100024, China
L
Li Fang
Key Laboratory of Media Audio and Video (Communication University of China), Ministry of Education, Beijing 100024, China
H
Hao Zhu
School of Electronic Science and Engineering, Nanjing University, Nanjing 210023, China
F
Fei Hu
Key Laboratory of Media Audio and Video (Communication University of China), Ministry of Education, Beijing 100024, China
Long Ye
Long Ye
Communication University of China
Multimedia Signal ProcessingArtificial Intelligence
Zhan Ma
Zhan Ma
Vision Lab, Nanjing University
Learning for Video Coding & CommunicationComputational Imaging