🤖 AI Summary
Photometric inconsistencies across multi-view images—introduced by on-device camera pipeline operations (e.g., exposure adjustment, white balance)—degrade novel view synthesis quality. Existing joint optimization approaches for scene representation and appearance embedding suffer from high computational cost and poor generalization. This paper proposes a Transformer-based bilateral grid prediction method, the first to incorporate Transformers into spatially adaptive bilateral grid modeling. Our approach enables zero-shot, cross-scene photometric consistency correction without retraining and integrates seamlessly into the 3D Gaussian Splatting framework. It preserves high-fidelity reconstruction while significantly improving training efficiency. Quantitative and qualitative evaluations across multiple datasets demonstrate that our method achieves reconstruction fidelity on par with or superior to state-of-the-art scene-specific optimization methods, with notably faster convergence.
📝 Abstract
Modern camera pipelines apply extensive on-device processing, such as exposure adjustment, white balance, and color correction, which, while beneficial individually, often introduce photometric inconsistencies across views. These appearance variations violate multi-view consistency and degrade the quality of novel view synthesis. Joint optimization of scene representations and per-image appearance embeddings has been proposed to address this issue, but at the cost of increased computational complexity and slower training. In this work, we propose a transformer-based method that predicts spatially adaptive bilateral grids to correct photometric variations in a multi-view consistent manner, enabling robust cross-scene generalization without the need for scene-specific retraining. By incorporating the learned grids into the 3D Gaussian Splatting pipeline, we improve reconstruction quality while maintaining high training efficiency. Extensive experiments show that our approach outperforms or matches existing scene-specific optimization methods in reconstruction fidelity and convergence speed.