OneHOI: Unifying Human-Object Interaction Generation and Editing

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing human-object interaction (HOI) methods are fragmented between generation and editing tasks, struggling to unify mixed-condition control, disentangle pose from contact relationships, and support multi-interaction scenarios. This work proposes OneHOI, the first framework to unify HOI generation and editing within a diffusion Transformer-based conditional denoising process. By introducing role- and instance-aware HOI tokens, a structured HOI attention mechanism, HOI-specific RoPE positional encoding, and a Relation Diffusion Transformer (R-DiT), the method effectively disentangles spatial and semantic relationships in multi-interaction settings. A modality dropout strategy enables joint training across diverse conditions. Evaluated on newly curated datasets including HOI-Edit-44K, OneHOI achieves state-of-the-art performance and supports flexible control modes such as layout-guided, layout-free, arbitrary masking, and hybrid-condition generation.

Technology Category

Application Category

📝 Abstract

Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.

Problem

Research questions and friction points this paper is trying to address.

Human-Object Interaction

HOI generation

HOI editing

structured interaction

multi-HOI scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-Object Interaction

Diffusion Transformer

Unified Generation and Editing