🤖 AI Summary
This work addresses the challenge of large-baseline novel view synthesis from only two input images, where existing methods struggle to reconstruct occluded regions and often deviate from the prescribed camera trajectory. To overcome these limitations, we propose ConfCtrl, a framework that leverages a confidence-aware interpolation mechanism to guide a diffusion model in strictly adhering to the target camera pose while simultaneously generating missing content. Our approach integrates confidence-weighted point cloud projections with a Kalman-like prediction-update strategy to dynamically balance geometric observations against pose-driven predictions. Additionally, it employs noise latent initialization combined with learned residual correction to enhance geometric consistency and generation stability. Experiments demonstrate that our method achieves visually plausible and geometrically coherent large-baseline view synthesis across multiple datasets, effectively reconstructing occluded regions.
📝 Abstract
We address the challenge of novel view synthesis from only two input images under large viewpoint changes. Existing regression-based methods lack the capacity to reconstruct unseen regions, while camera-guided diffusion models often deviate from intended trajectories due to noisy point cloud projections or insufficient conditioning from camera poses. To address these issues, we propose ConfCtrl, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions. ConfCtrl initializes the diffusion process by combining a confidence-weighted projected point cloud latent with noise as the conditioning input. It then applies a Kalman-inspired predict-update mechanism, treating the projected point cloud as a noisy measurement and using learned residual corrections to balance pose-driven predictions with noisy geometric observations. This allows the model to rely on reliable projections while down-weighting uncertain regions, yielding stable, geometry-aware generation. Experiments on multiple datasets show that ConfCtrl produces geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions under large viewpoint changes.