ECHO: Ego-Centric modeling of Human-Object interactions

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the underexplored problem of egocentric human-object interaction (HOI) reconstruction—using head- and wrist-mounted sensors (e.g., smart glasses + smartwatch)—and proposes the first unified framework that jointly estimates 3D human pose, object motion, and contact states solely from head and wrist tracking data. To enable coherent modeling, we introduce a novel head-centered canonical space where all three variables are simultaneously represented and optimized. We further design a conveyor-belt–inspired progressive inference strategy that supports arbitrarily long sequences and improves global orientation robustness. For joint generation, we employ a Diffusion Transformer to perform synchronized diffusion over the three interdependent variables. Our method achieves state-of-the-art performance on multiple benchmarks, significantly outperforming prior approaches in egocentric HOI reconstruction.

Technology Category

Application Category

📝 Abstract
Modeling human-object interactions (HOI) from an egocentric perspective is a largely unexplored yet important problem due to the increasing adoption of wearable devices, such as smart glasses and watches. We investigate how much information about interaction can be recovered from only head and wrists tracking. Our answer is ECHO (Ego-Centric modeling of Human-Object interactions), which, for the first time, proposes a unified framework to recover three modalities: human pose, object motion, and contact from such minimal observation. ECHO employs a Diffusion Transformer architecture and a unique three-variate diffusion process, which jointly models human motion, object trajectory, and contact sequence, allowing for flexible input configurations. Our method operates in a head-centric canonical space, enhancing robustness to global orientation. We propose a conveyor-based inference, which progressively increases the diffusion timestamp with the frame position, allowing us to process sequences of any length. Through extensive evaluation, we demonstrate that ECHO outperforms existing methods that do not offer the same flexibility, setting a state-of-the-art in egocentric HOI reconstruction.
Problem

Research questions and friction points this paper is trying to address.

Recovering human-object interactions from head and wrist tracking
Modeling human pose, object motion, and contact simultaneously
Handling egocentric perspectives with minimal observation inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer architecture for unified modeling
Head-centric canonical space enhancing orientation robustness
Conveyor-based inference for variable-length sequence processing
🔎 Similar Papers
No similar papers found.