🤖 AI Summary
This work addresses the challenge of camera pose estimation from sparse views—particularly two-view settings—where non-convex loss landscapes, geometric symmetries, and self-similarities often trap optimization in local minima. To mitigate erroneous convergence caused by geometric ambiguities, the authors propose a landscape-aware score-based optimization method that reshapes the loss landscape to guide the optimization trajectory more effectively. This approach is further refined through integration with a viewpoint-conditioned diffusion model, such as Zero123. The proposed method substantially improves convergence stability and sample efficiency, achieving high estimation accuracy under sparse-view conditions while significantly reducing reliance on brute-force multi-start sampling strategies.
📝 Abstract
Accurate camera viewpoint estimation under sparse-view conditions remains challenging, particularly in two-view scenarios. Recent approaches leverage diffusion models such as Zero123 to synthesize novel views conditioned on relative viewpoint, showing promising results when repurposed for viewpoint estimation via optimization with MSE loss. However, existing methods often suffer from nonconvex loss landscape with numerous local minima, making them sensitive to initialization and reliant on naive multistart strategies. We analyze these optimization challenges and visualize failure cases, showing that geometric ambiguities, such as symmetry and self-similarity, can mislead gradient-based updates toward incorrect viewpoints. To address these limitations, we propose a score-based method that reshapes the optimization landscape to guide updates toward the ground-truth viewpoint, followed by a refinement stage using a viewpoint-conditioned diffusion model. Experiments show that our method improves convergence, reduces reliance on brute-force sampling, and achieves competitive accuracy with higher sample-efficiency.