🤖 AI Summary
Modeling hand-object interaction—particularly with non-rigid (e.g., cloth, elastomers) or articulated objects—faces challenges including inaccurate physical modeling, dynamic distortion in generation, and poor real-time performance. This paper introduces the first autoregressive video generation framework designed for real-world scenarios, integrating a causal inference architecture with a hybrid post-training strategy to overcome the limitations of conventional physics engines in capturing complex deformations and dynamic coupling. Implemented on a 1.3B-parameter model, it achieves real-time generation at 640×368 resolution (18 FPS), with only 150 ms latency on a single RTX 5090 GPU and support for high-fidelity outputs exceeding 60 seconds. Experiments demonstrate superior performance over state-of-the-art generative models and physics-based simulators across visual quality, physical plausibility, and interaction fidelity.
📝 Abstract
Modeling and synthesizing complex hand-object interactions remains a significant challenge, even for state-of-the-art physics engines. Conventional simulation-based approaches rely on explicitly defined rigid object models and pre-scripted hand gestures, making them inadequate for capturing dynamic interactions with non-rigid or articulated entities such as deformable fabrics, elastic materials, hinge-based structures, furry surfaces, or even living creatures. In this paper, we present SpriteHand, an autoregressive video generation framework for real-time synthesis of versatile hand-object interaction videos across a wide range of object types and motion patterns. SpriteHand takes as input a static object image and a video stream in which the hands are imagined to interact with the virtual object embedded in a real-world scene, and generates corresponding hand-object interaction effects in real time. Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence. Our 1.3B model supports real-time streaming generation at around 18 FPS and 640x368 resolution, with an approximate 150 ms latency on a single NVIDIA RTX 5090 GPU, and more than a minute of continuous output. Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.