🤖 AI Summary
This work addresses 3D geometric reconstruction of hand-held objects from monocular RGB images. Methodologically, it introduces a physics-aware implicit diffusion generative framework that leverages hand-object interaction as a geometric prior within an optimization loop. The loop jointly refines hand and object poses via velocity-field supervision and multi-modal geometric cues—including signed distance fields (SDF), normal/depth alignment, silhouette consistency, 2D keypoint reprojection, and contact constraints—to explicitly enforce contact plausibility and non-penetration. Innovatively, a diffusion-guided mechanism is designed to integrate appearance completion with geometric optimization during inference, substantially improving robustness under occlusion. Experiments demonstrate high-fidelity, temporally coherent object reconstruction in real-world scenes. The method generalizes effectively to complex occlusions and in-the-wild environments without post-processing, outperforming existing state-of-the-art approaches.
📝 Abstract
We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images by leveraging hand-object interaction as geometric guidance. Our method conditions a latent diffusion model on an inpainted object appearance and uses inference-time guidance to optimize the object reconstruction, while simultaneously ensuring plausible hand-object interactions. Unlike prior methods that rely on extensive post-processing or produce low-quality reconstructions, our approach directly generates high-quality object geometry during the diffusion process by introducing guidance with an optimization-in-the-loop design. Specifically, we guide the diffusion model by applying supervision to the velocity field while simultaneously optimizing the transformations of both the hand and the object being reconstructed. This optimization is driven by multi-modal geometric cues, including normal and depth alignment, silhouette consistency, and 2D keypoint reprojection. We further incorporate signed distance field supervision and enforce contact and non-intersection constraints to ensure physical plausibility of hand-object interaction. Our method yields accurate, robust and coherent reconstructions under occlusion while generalizing well to in-the-wild scenarios.