DexSIM: Real-time Dexterous Simulation with Unified Causal Video Diffusion

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video diffusion models struggle to simultaneously achieve real-time interactivity, long-term spatial consistency, and memory capabilities in dexterous hand manipulation simulation. This work proposes DexSIM, the first framework to introduce a unified causal video diffusion model to this task. By integrating bidirectional diffusion with autoregressive rolling training, DexSIM unifies hand motion trajectories and video embeddings into a shared feature space. It further incorporates a Gaussian heatmap-based hand encoding scheme and a spatial cache attention mechanism to enhance 3D awareness and long-range consistency. Experiments demonstrate that DexSIM outperforms baseline methods in pixel- and semantic-level similarity, motion fidelity, and hand projection accuracy, supports action transfer, and achieves real-time interactive performance at 15.24 frames per second.
📝 Abstract
Recent progress of video diffusion models have enabled extensive simulation of the physical world. While simulation with hand object interaction has been less explored. We propose DexSIM, a dexterous simulation framework for simulating dexterous manipulation in real-time. While previous works utilizing video diffusion and 3D reconstruction focus on navigation, dexterous manipulation has been limited while it has extensive applications for creating interactive experiences with the simulated world and for generating synthetic data for robotics. Existing methods lack real-time interactivity and long-term spatial consistency and memory. We propose a 2-stage training framework for DexSIM. First we train a bi-directional video diffusion model by jointly embedding the hand action trajectory and video in a unified feature space. We utilize gaussian heatmap hand encoding for more accurate hand representation. Then we conduct a roll-out based autoregressive training with updated spatial cache as attention sink for spatial memory, which improves long-term consistency and 3D aware dexterous manipulation simulation. DexSIM outperforms the baseline on pixel and semantic similarity, motion fidelity, and hand projection accuracy. It also allows new applications such as hand motion transfer and runs at 15.24 FPS real-time interactivity.
Problem

Research questions and friction points this paper is trying to address.

dexterous manipulation
real-time simulation
video diffusion
spatial consistency
hand-object interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

dexterous manipulation
video diffusion
real-time simulation
spatial memory
hand-object interaction
🔎 Similar Papers
No similar papers found.