TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions

📅 2024-12-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Existing 3D human–object interaction (HOI) modeling predominantly adopts unidirectional paradigms (e.g., human→object or object→human), lacking joint generative capability over humans, objects, and their interactions. This work introduces the first tri-directional joint diffusion framework that simultaneously models all three modalities and supports seven distinct conditional generation tasks. Methodologically, we propose a cross-modal Transformer architecture with unified tokenized representations and pioneer a dual-path conditional embedding scheme—integrating textual descriptions and contact graphs—to enable fine-grained, controllable synthesis. Evaluated on GRAB and BEHAVE, our model significantly outperforms unidirectional baselines in both qualitative and quantitative metrics. Moreover, it demonstrates strong capabilities in scene completion, contact-aware data synthesis, and zero-shot generalization to unseen object geometries.

Technology Category

Application Category

📝 Abstract

Modeling 3D human-object interaction (HOI) is a problem of great interest for computer vision and a key enabler for virtual and mixed-reality applications. Existing methods work in a one-way direction: some recover plausible human interactions conditioned on a 3D object; others recover the object pose conditioned on a human pose. Instead, we provide the first unified model - TriDi which works in any direction. Concretely, we generate Human, Object, and Interaction modalities simultaneously with a new three-way diffusion process, allowing to model seven distributions with one network. We implement TriDi as a transformer attending to the various modalities' tokens, thereby discovering conditional relations between them. The user can control the interaction either as a text description of HOI or a contact map. We embed these two representations into a shared latent space, combining the practicality of text descriptions with the expressiveness of contact maps. Using a single network, TriDi unifies all the special cases of prior work and extends to new ones, modeling a family of seven distributions. Remarkably, despite using a single model, TriDi generated samples surpass one-way specialized baselines on GRAB and BEHAVE in terms of both qualitative and quantitative metrics, and demonstrating better diversity. We show the applicability of TriDi to scene population, generating objects for human-contact datasets, and generalization to unseen object geometry. The project page is available at: https://virtualhumans.mpi-inf.mpg.de/tridi.

Problem

Research questions and friction points this paper is trying to address.

Unified 3D human-object interaction modeling

Simultaneous generation of human, object, and interaction modalities

Combining text descriptions with contact maps for control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model for 3D human-object interaction.

Three-way diffusion process for simultaneous generation.

Transformer-based network with shared latent space.

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos