SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Modeling hand-object interaction—particularly with non-rigid (e.g., cloth, elastomers) or articulated objects—faces challenges including inaccurate physical modeling, dynamic distortion in generation, and poor real-time performance. This paper introduces the first autoregressive video generation framework designed for real-world scenarios, integrating a causal inference architecture with a hybrid post-training strategy to overcome the limitations of conventional physics engines in capturing complex deformations and dynamic coupling. Implemented on a 1.3B-parameter model, it achieves real-time generation at 640×368 resolution (18 FPS), with only 150 ms latency on a single RTX 5090 GPU and support for high-fidelity outputs exceeding 60 seconds. Experiments demonstrate superior performance over state-of-the-art generative models and physics-based simulators across visual quality, physical plausibility, and interaction fidelity.

Technology Category

Application Category

📝 Abstract
Modeling and synthesizing complex hand-object interactions remains a significant challenge, even for state-of-the-art physics engines. Conventional simulation-based approaches rely on explicitly defined rigid object models and pre-scripted hand gestures, making them inadequate for capturing dynamic interactions with non-rigid or articulated entities such as deformable fabrics, elastic materials, hinge-based structures, furry surfaces, or even living creatures. In this paper, we present SpriteHand, an autoregressive video generation framework for real-time synthesis of versatile hand-object interaction videos across a wide range of object types and motion patterns. SpriteHand takes as input a static object image and a video stream in which the hands are imagined to interact with the virtual object embedded in a real-world scene, and generates corresponding hand-object interaction effects in real time. Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence. Our 1.3B model supports real-time streaming generation at around 18 FPS and 640x368 resolution, with an approximate 150 ms latency on a single NVIDIA RTX 5090 GPU, and more than a minute of continuous output. Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.
Problem

Research questions and friction points this paper is trying to address.

Real-time synthesis of hand-object interaction videos
Modeling dynamic interactions with non-rigid or articulated objects
Generating physically plausible and visually realistic interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive video generation for hand-object interaction
Causal inference architecture enhancing temporal coherence
Hybrid post-training approach improving visual realism
🔎 Similar Papers
No similar papers found.
Zisu Li
Zisu Li
The Hong Kong University of Science and Technology
Human-Computer Interaction
H
Hengye Lyu
HKUST (Guangzhou)
J
Jiaxin Shi
XMax.AI Ltd.
Y
Yufeng Zeng
HKUST (Guangzhou)
Mingming Fan
Mingming Fan
The Hong Kong University of Science and Technology (Guangzhou)
HCIAccessible ComputingVR/AR/MRHuman-AI InteractionHuman-Agent Interaction
H
Hanwang Zhang
Nanyang Technological University
C
Chen Liang
HKUST (Guangzhou) & XMax.AI Ltd.