🤖 AI Summary
Existing human-object interaction (HOI) methods are fragmented between generation and editing tasks, struggling to unify mixed-condition control, disentangle pose from contact relationships, and support multi-interaction scenarios. This work proposes OneHOI, the first framework to unify HOI generation and editing within a diffusion Transformer-based conditional denoising process. By introducing role- and instance-aware HOI tokens, a structured HOI attention mechanism, HOI-specific RoPE positional encoding, and a Relation Diffusion Transformer (R-DiT), the method effectively disentangles spatial and semantic relationships in multi-interaction settings. A modality dropout strategy enables joint training across diverse conditions. Evaluated on newly curated datasets including HOI-Edit-44K, OneHOI achieves state-of-the-art performance and supports flexible control modes such as layout-guided, layout-free, arbitrary masking, and hybrid-condition generation.
📝 Abstract
Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.