How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the problem of generating full 3D hand motion sequences and dynamic contact trajectories from a single RGB image, an action text description, and a single 3D contact point on the object surface. We propose a joint framework comprising an interaction-aware vector-quantized variational autoencoder (VQ-VAE) and an index-conditioned Transformer decoder: the VQ-VAE models high-dimensional interaction semantics in latent space, while the Transformer autoregressively generates motion sequences conditioned on discrete codebook indices. To ensure data quality, we introduce a contact-aware data engine and the HoloAssist pipeline for multi-source 3D interaction data extraction. Evaluated on a large-scale, diverse benchmark spanning multiple object categories, action types, and scenes, our method significantly outperforms state-of-the-art Transformer- and diffusion-based baselines. To our knowledge, it is the first approach to achieve strong generalization and semantic controllability in single-view 3D interactive motion synthesis.

Technology Category

Application Category

📝 Abstract

We tackle the novel problem of predicting 3D hand motion and contact maps (or Interaction Trajectories) given a single RGB view, action text, and a 3D contact point on the object as input. Our approach consists of (1) Interaction Codebook: a VQVAE model to learn a latent codebook of hand poses and contact points, effectively tokenizing interaction trajectories, (2) Interaction Predictor: a transformer-decoder module to predict the interaction trajectory from test time inputs by using an indexer module to retrieve a latent affordance from the learned codebook. To train our model, we develop a data engine that extracts 3D hand poses and contact trajectories from the diverse HoloAssist dataset. We evaluate our model on a benchmark that is 2.5-10X larger than existing works, in terms of diversity of objects and interactions observed, and test for generalization of the model across object categories, action categories, tasks, and scenes. Experimental results show the effectiveness of our approach over transformer&diffusion baselines across all settings.

Problem

Research questions and friction points this paper is trying to address.

Predict 3D hand motion and contact maps from single RGB view, action text, and 3D contact point

Learn latent codebook of hand poses and contact points for interaction trajectories

Evaluate model generalization across diverse objects, actions, tasks, and scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

VQVAE model for hand pose tokenization

Transformer-decoder for trajectory prediction

Data engine for 3D pose extraction

🔎 Similar Papers

Hierarchical Procedural Framework for Low-latency Robot-Assisted Hand-Object Interaction

2024-05-29Citations: 0