FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation

📅 2026-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision–language–action models struggle with long-horizon, contact-intensive dexterous manipulation tasks due to their lack of explicit modeling of hand–object interaction structures. This work proposes FlowHOI, a two-stage flow-matching framework that decouples geometry-guided grasping from semantics-driven manipulation. By integrating first-person observations, language instructions, and 3D Gaussian Splatting-based scene reconstruction, FlowHOI generates semantically aligned and temporally coherent hand–object interaction sequences, including hand poses, object poses, and contact states. The method introduces scene-token conditioning, a motion–text alignment loss, and a reconstruction pipeline that recovers high-fidelity hand–object trajectories from large-scale egocentric videos. Evaluated on GRAB and HOT3D benchmarks, FlowHOI achieves state-of-the-art action recognition accuracy, improves physical simulation success rates by 1.7×, accelerates inference by 40×, and successfully executes four dexterous manipulation tasks on a real robot.

Technology Category

Application Category

📝 Abstract
Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion-text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction. To address the scarcity of high-fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand-object trajectories and meshes from large-scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7$\times$ higher physics simulation success rate than the strongest diffusion-based baseline, while delivering a 40$\times$ inference speedup. We further demonstrate real-robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real-robot execution pipelines.
Problem

Research questions and friction points this paper is trying to address.

Hand-Object Interaction
Dexterous Manipulation
Embodiment-Agnostic Representation
Long-Horizon Tasks
Contact-Rich Interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

flow-based generation
hand-object interaction
semantic grounding
3D Gaussian splatting
dexterous manipulation
🔎 Similar Papers
No similar papers found.
H
Huajian Zeng
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Lingyun Chen
Lingyun Chen
Munich Institute of Robotics and Machine Intelligence, TUM
J
Jiaqi Yang
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Y
Yuantai Zhang
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Fan Shi
Fan Shi
Assistant Professor in National University of Singapore
Robotics
Peidong Liu
Peidong Liu
Westlake University
3D computer visionRobotics
Xingxing Zuo
Xingxing Zuo
Assistant Professor @MBZUAI
RoboticsState EstimationEmbodied AI