FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation

📅 2026-02-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
Existing vision–language–action models struggle with long-horizon, contact-intensive dexterous manipulation tasks due to their lack of explicit modeling of hand–object interaction structures. This work proposes FlowHOI, a two-stage flow-matching framework that decouples geometry-guided grasping from semantics-driven manipulation. By integrating first-person observations, language instructions, and 3D Gaussian Splatting-based scene reconstruction, FlowHOI generates semantically aligned and temporally coherent hand–object interaction sequences, including hand poses, object poses, and contact states. The method introduces scene-token conditioning, a motion–text alignment loss, and a reconstruction pipeline that recovers high-fidelity hand–object trajectories from large-scale egocentric videos. Evaluated on GRAB and HOT3D benchmarks, FlowHOI achieves state-of-the-art action recognition accuracy, improves physical simulation success rates by 1.7×, accelerates inference by 40×, and successfully executes four dexterous manipulation tasks on a real robot.

Technology Category

Application Category

📝 Abstract
Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion-text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction. To address the scarcity of high-fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand-object trajectories and meshes from large-scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7$\times$ higher physics simulation success rate than the strongest diffusion-based baseline, while delivering a 40$\times$ inference speedup. We further demonstrate real-robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real-robot execution pipelines.
Problem

Research questions and friction points this paper is trying to address.

Hand-Object Interaction
Dexterous Manipulation
Embodiment-Agnostic Representation
Long-Horizon Tasks
Contact-Rich Interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

flow-based generation
hand-object interaction
semantic grounding
3D Gaussian splatting
dexterous manipulation