StructBiHOI: Structured Articulation Modeling for Long--Horizon Bimanual Hand--Object Interaction Generation

πŸ“… 2026-03-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing approaches struggle to simultaneously ensure temporal consistency, physical plausibility, and semantic alignment in long-horizon bimanual robotic manipulation, particularly when generating coherent hand–object interaction sequences under multimodal conditions. This work proposes a hierarchical structured modeling framework that, for the first time, decouples long-term joint motion evolution from single-frame hand pose refinement. By integrating structured variational autoencoders (jointVAE/maniVAE) with a Mamba-based state-space diffusion denoiser, the method efficiently captures long-range dependencies with linear computational complexity. Evaluated on both bimanual manipulation and single-hand grasping benchmarks, the approach significantly outperforms current state-of-the-art methods, achieving leading performance in long-horizon stability, motion realism, and computational efficiency.

Technology Category

Application Category

πŸ“ Abstract
Recent progress in 3D hand--object interaction (HOI) generation has primarily focused on single--hand grasp synthesis, while bimanual manipulation remains significantly more challenging. Long--horizon planning instability, fine--grained joint articulation, and complex cross--hand coordination make coherent bimanual generation difficult, especially under multimodal conditions. Existing approaches often struggle to simultaneously ensure temporal consistency, physical plausibility, and semantic alignment over extended sequences. We propose StructBiHOI, a Structured articulation modeling framework for long-horizon Bimanual HOI generation. Our key insight is to structurally disentangle temporal joint planning from frame--level manipulation refinement. Specifically, a jointVAE models long-term joint evolution conditioned on object geometry and task semantics, while a maniVAE refines fine-grained hand poses at the single--frame level. To enable stable and efficient long--sequence generation, we incorporate a state--space--inspired diffusion denoiser based on Mamba, which models long--range dependencies with linear complexity. This hierarchical design facilitates coherent dual-hand coordination and articulated object interaction. Extensive experiments on bimanual manipulation and single-hand grasping benchmarks demonstrate that our method achieves superior long--horizon stability, motion realism, and computational efficiency compared to strong baselines.
Problem

Research questions and friction points this paper is trying to address.

bimanual manipulation
hand-object interaction
long-horizon generation
temporal consistency
cross-hand coordination
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured articulation modeling
bimanual hand-object interaction
long-horizon generation
state-space diffusion
Mamba-based denoiser