Learning to Generate Human-Human-Object Interactions from Textual Descriptions

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This paper introduces the novel task of text-to-motion generation for Human-Human-Object Interaction (HHOI), aiming to synthesize spatially coordinated, multi-agent behaviors involving both humans and objects from natural language descriptions. To support this task, we construct the first annotated HHOI dataset and propose a unified score-based diffusion framework. Our method decomposes generation into two complementary sub-processes—text-to-human-object interaction and text-to-human-human interaction—and integrates them end-to-end via geometric and semantic alignment. Extensive experiments demonstrate that our approach significantly outperforms single-human interaction baselines, achieving superior interaction plausibility, spatial consistency, and motion diversity. Moreover, it generalizes effectively to multi-person motion generation. This work establishes a new paradigm for controllable synthesis of complex, socially grounded scenes.

Technology Category

Application Category

📝 Abstract

The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.

Problem

Research questions and friction points this paper is trying to address.

Modeling correlations between two people interacting with shared objects

Generating realistic human-human-object interactions from text descriptions

Extending interaction modeling to multi-human scenarios with objects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizing HHOI data using image generative models

Training text-to-interaction models with score-based diffusion

Unified generative framework integrating multiple interaction models

🔎 Similar Papers

No similar papers found.