Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of 3D object reconstruction under severe hand occlusion by proposing a multimodal approach that fuses visual, proprioceptive, and multi-point tactile signals. Leveraging a structured variational autoencoder (VAE) and a conditional flow-matching diffusion model, the method generates camera-aligned, physically consistent signed distance fields in a latent space. These reconstructions are refined through differentiable rendering and physics-informed optimization to ensure metric-scale accuracy, penetration-free geometry, and precise contact alignment. This study is the first to integrate proprioception and multi-contact tactile feedback into generative 3D reconstruction under occlusion, demonstrating substantial improvements over vision-only baselines in simulation and successful generalization to unseen anthropomorphic robotic platforms, thereby validating its efficacy and robustness.

Technology Category

Application Category

📝 Abstract
We propose a multimodal, physically grounded approach for metric-scale amodal object reconstruction and pose estimation under severe hand occlusion. Unlike prior occlusion-aware 3D generation methods that rely only on vision, we leverage physical interaction signals: proprioception provides the posed hand geometry, and multi-contact touch constrains where the object surface must lie, reducing ambiguity in occluded regions. We represent object structure as a pose-aware, camera-aligned signed distance field (SDF) and learn a compact latent space with a Structure-VAE. In this latent space, we train a conditional flow-matching diffusion model, pretraining on vision-only images and finetuning on occluded manipulation scenes while conditioning on visible RGB evidence, occluder/visibility masks, the hand latent representation, and tactile information. Crucially, we incorporate physics-based objectives and differentiable decoder-guidance during finetuning and inference to reduce hand--object interpenetration and to align the reconstructed surface with contact observations. Because our method produces a metric, physically consistent structure estimate, it integrates naturally into existing two-stage reconstruction pipelines, where a downstream module refines geometry and predicts appearance. Experiments in simulation show that adding proprioception and touch substantially improves completion under occlusion and yields physically plausible reconstructions at correct real-world scale compared to vision-only baselines; we further validate transfer by deploying the model on a real humanoid robot with an end-effector different from those used during training.
Problem

Research questions and friction points this paper is trying to address.

hand occlusion
3D reconstruction
amodal completion
physical plausibility
metric-scale estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

proprioception
multi-contact touch
physically grounded reconstruction
conditional diffusion model
amodal 3D reconstruction
🔎 Similar Papers
No similar papers found.