MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D human–object interaction (HOI) benchmarks severely lack coverage of causal, goal-directed, and collaborative multi-human–multi-object interactions prevalent in real-world scenarios. To address this gap, we introduce MMHOI, the first large-scale multi-human–object 3D interaction dataset, featuring fully annotated 3D poses, shapes, and actions across 12 everyday scenarios. To model complex interaction structures, we propose a structured dual-path representation that explicitly jointly encodes object geometry, interaction relations, and action semantics. We further design MMHOI-Net, an end-to-end Transformer-based architecture that unifies prediction of 3D human–object geometric configurations, interaction relation graphs, and action categories. Extensive experiments on MMHOI and CORE4D demonstrate significant improvements over prior methods, achieving state-of-the-art performance in both interaction modeling fidelity and geometric reconstruction quality—marking a dual advance in understanding and reconstructing complex 3D interactions.

Technology Category

Application Category

📝 Abstract
Real-world scenes often feature multiple humans interacting with multiple objects in ways that are causal, goal-oriented, or cooperative. Yet existing 3D human-object interaction (HOI) benchmarks consider only a fraction of these complex interactions. To close this gap, we present MMHOI -- a large-scale, Multi-human Multi-object Interaction dataset consisting of images from 12 everyday scenarios. MMHOI offers complete 3D shape and pose annotations for every person and object, along with labels for 78 action categories and 14 interaction-specific body parts, providing a comprehensive testbed for next-generation HOI research. Building on MMHOI, we present MMHOI-Net, an end-to-end transformer-based neural network for jointly estimating human-object 3D geometries, their interactions, and associated actions. A key innovation in our framework is a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance the interaction prediction. Experiments on MMHOI and the recently proposed CORE4D datasets demonstrate that our approach achieves state-of-the-art performance in multi-HOI modeling, excelling in both accuracy and reconstruction quality.
Problem

Research questions and friction points this paper is trying to address.

Modeling complex 3D multi-human multi-object interactions in real-world scenes
Addressing limitations in existing 3D human-object interaction benchmarks
Developing comprehensive framework for joint geometry estimation and interaction prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-patch representation models object interactions
Transformer network jointly estimates 3D geometries and actions
Action recognition enhances interaction prediction accuracy