🤖 AI Summary
Existing object pose estimation methods often produce geometrically infeasible hypotheses, such as interpenetration with a robotic gripper or detachment from supporting surfaces. This work proposes an end-to-end multimodal optimization framework that, for the first time, integrates differentiable physics simulation, differentiable rendering, and visuo-tactile perception to jointly enforce data consistency and physical plausibility. Experimental results demonstrate that the proposed approach reduces intersection volume error by 73% when the initial pose estimate is reasonably accurate, and achieves over 87% reduction under high-uncertainty conditions, while also significantly decreasing both translational and rotational errors.
📝 Abstract
State-of-the-art object pose estimation methods are prone to generating geometrically infeasible pose hypotheses. This problem is prevalent in dexterous manipulation, where estimated poses often intersect with the robotic hand or are not lying on a support surface. We propose a multi-modal pose refinement approach that combines differentiable physics simulation, differentiable rendering and visuo-tactile sensing to optimize object poses for both spatial accuracy and physical consistency. Simulated experiments show that our approach reduces the intersection volume error between the object and robotic hand by 73\% when the initial estimate is accurate and by over 87\% under high initial uncertainty, significantly outperforming standard ICP-based baselines. Furthermore, the improvement in geometric plausibility is accompanied by a concurrent reduction in translation and orientation errors. Achieving pose estimation that is grounded in physical reality while remaining faithful to multi-modal sensor inputs is a critical step toward robust in-hand manipulation.