🤖 AI Summary
Existing video diffusion models often suffer from geometric distortions and limited camera controllability in novel view synthesis. To address this, this work proposes GS-Adapter, a plug-and-play framework that explicitly integrates 3D Gaussian Splatting geometry priors with diffusion features in the latent space to enforce geometric consistency during view generation. The method requires no modifications to the input pipeline or additional training and is compatible with diverse geometric representations. Evaluated across nine scenes and eighteen configurations, GS-Adapter achieves state-of-the-art performance, outperforming SEVA and CameraCtrl by 11.3% and 14.9%, respectively. It reduces translational error by up to a factor of two and decreases Chamfer distance by as much as sevenfold, demonstrating significantly improved geometric fidelity and view coherence.
📝 Abstract
Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.