End-to-End HOI Reconstruction Transformer with Graph-based Encoding

📅 2025-03-08

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing 3D HOI reconstruction methods suffer from an inherent trade-off between global structural modeling and fine-grained contact detail recovery. To address this, we propose an implicit interaction modeling paradigm that eliminates explicit interaction representation. Our method introduces a graph-residual block embedded within a Transformer architecture to jointly encode topological relationships between human and object vertices, enabling unified optimization of both global geometry and local contact regions. It integrates self-attention mechanisms, graph neural network–based encoding, differentiable mesh representations, and an end-to-end joint optimization framework. Evaluated on BEHAVE and InterCap, our approach achieves state-of-the-art performance: on InterCap, it reduces human and object mesh reconstruction errors by 8.9% and 8.6%, respectively. To the best of our knowledge, this is the first method to achieve high-fidelity, end-to-end joint mesh reconstruction for human–object interaction.

Technology Category

Application Category

📝 Abstract

With the diversification of human-object interaction (HOI) applications and the success of capturing human meshes, HOI reconstruction has gained widespread attention. Existing mainstream HOI reconstruction methods often rely on explicitly modeling interactions between humans and objects. However, such a way leads to a natural conflict between 3D mesh reconstruction, which emphasizes global structure, and fine-grained contact reconstruction, which focuses on local details. To address the limitations of explicit modeling, we propose the End-to-End HOI Reconstruction Transformer with Graph-based Encoding (HOI-TG). It implicitly learns the interaction between humans and objects by leveraging self-attention mechanisms. Within the transformer architecture, we devise graph residual blocks to aggregate the topology among vertices of different spatial structures. This dual focus effectively balances global and local representations. Without bells and whistles, HOI-TG achieves state-of-the-art performance on BEHAVE and InterCap datasets. Particularly on the challenging InterCap dataset, our method improves the reconstruction results for human and object meshes by 8.9% and 8.6%, respectively.

Problem

Research questions and friction points this paper is trying to address.

Addresses conflict between global 3D mesh and fine-grained contact reconstruction.

Proposes transformer with graph encoding for implicit human-object interaction learning.

Improves human and object mesh reconstruction on challenging datasets.

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-End Transformer for HOI reconstruction

Graph-based encoding with self-attention mechanisms

Graph residual blocks for topology aggregation

🔎 Similar Papers

No similar papers found.

World Labs

$250,000-$350,000 base salary (good-faith estimate for San Francisco Bay Area upon hire; actual offer based on experience, skills, and qualifications)

San Francisco / San Francisco Office, San Francisco, California, United States

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)