VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tunin

📅 2025-10-16
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Bimanual robotic manipulation in complex contact-rich assembly suffers from poor policy generalizability due to suboptimal and insufficiently diverse human demonstrations. Method: We propose VT-Refine—a framework that first initializes a diffusion-based policy model using synchronized vision–tactile inputs, then refines it via large-scale reinforcement learning in a high-fidelity, GPU-accelerated piezoresistive tactile simulator integrated with a digital twin environment. Contribution/Results: VT-Refine enables robust initialization and fine-grained optimization of vision–tactile policies while facilitating efficient sim-to-real transfer. Experiments demonstrate significant improvements in success rates across multiple assembly tasks in both simulation and real-world settings, validating the efficacy of enhanced data diversity and co-adaptive policy refinement. The framework establishes a scalable, vision–tactile co-learning paradigm for complex bimanual manipulation.

Technology Category

Application Category

📝 Abstract
Humans excel at bimanual assembly tasks by adapting to rich tactile feedback -- a capability that remains difficult to replicate in robots through behavioral cloning alone, due to the suboptimality and limited diversity of human demonstrations. In this work, we present VT-Refine, a visuo-tactile policy learning framework that combines real-world demonstrations, high-fidelity tactile simulation, and reinforcement learning to tackle precise, contact-rich bimanual assembly. We begin by training a diffusion policy on a small set of demonstrations using synchronized visual and tactile inputs. This policy is then transferred to a simulated digital twin equipped with simulated tactile sensors and further refined via large-scale reinforcement learning to enhance robustness and generalization. To enable accurate sim-to-real transfer, we leverage high-resolution piezoresistive tactile sensors that provide normal force signals and can be realistically modeled in parallel using GPU-accelerated simulation. Experimental results show that VT-Refine improves assembly performance in both simulation and the real world by increasing data diversity and enabling more effective policy fine-tuning. Our project page is available at https://binghao-huang.github.io/vt_refine/.
Problem

Research questions and friction points this paper is trying to address.

Learning bimanual assembly with visuo-tactile feedback
Improving robot policy robustness through simulation fine-tuning
Enabling precise contact-rich assembly via sim-to-real transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining real demonstrations with tactile simulation
Transferring policy to digital twin for refinement
Using high-resolution piezoresistive sensors for sim-to-real