🤖 AI Summary
This work addresses the cross-modal generation and understanding of high-fidelity 3D hand–object interaction (HOI) sequences. To handle heterogeneous conditioning signals—including text, object identity, and partial action sequences—we propose a physics-driven, hand–object disentangled VQ-VAE tokenizer that yields motion-aware, discrete representations of HOI sequences. We further design a motion-aware multimodal language model capable of jointly processing textual and HOI tokens, integrated within a large language model–driven bidirectional mapping architecture. This enables text↔HOI sequence generation, partial sequence completion, and cross-modal description. Our method achieves state-of-the-art performance across multiple benchmarks: +2.01% in R-Precision and −2.56 in Fréchet Inception Distance (FID). It is the first to realize bidirectional, cross-modal HOI sequence modeling with high fidelity, physical plausibility, and editability.
📝 Abstract
We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (eg text, objects, partial sequences). At its core, HOIGPT utilizes a large language model to predict the bidrectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences. To facilitate HOI understanding with a large language model, this paper introduces two key innovations: (1) a novel physically grounded HOI tokenizer, the hand-object decomposed VQ-VAE, for discretizing HOI sequences, and (2) a motion-aware language model trained to process and generate both text and HOI tokens. Extensive experiments demonstrate that HOIGPT sets new state-of-the-art performance on both text generation (+2.01% R Precision) and HOI generation (-2.56 FID) across multiple tasks and benchmarks.