🤖 AI Summary
This work addresses the underexplored problem of egocentric human-object interaction (HOI) reconstruction—using head- and wrist-mounted sensors (e.g., smart glasses + smartwatch)—and proposes the first unified framework that jointly estimates 3D human pose, object motion, and contact states solely from head and wrist tracking data. To enable coherent modeling, we introduce a novel head-centered canonical space where all three variables are simultaneously represented and optimized. We further design a conveyor-belt–inspired progressive inference strategy that supports arbitrarily long sequences and improves global orientation robustness. For joint generation, we employ a Diffusion Transformer to perform synchronized diffusion over the three interdependent variables. Our method achieves state-of-the-art performance on multiple benchmarks, significantly outperforming prior approaches in egocentric HOI reconstruction.
📝 Abstract
Modeling human-object interactions (HOI) from an egocentric perspective is a largely unexplored yet important problem due to the increasing adoption of wearable devices, such as smart glasses and watches. We investigate how much information about interaction can be recovered from only head and wrists tracking. Our answer is ECHO (Ego-Centric modeling of Human-Object interactions), which, for the first time, proposes a unified framework to recover three modalities: human pose, object motion, and contact from such minimal observation. ECHO employs a Diffusion Transformer architecture and a unique three-variate diffusion process, which jointly models human motion, object trajectory, and contact sequence, allowing for flexible input configurations. Our method operates in a head-centric canonical space, enhancing robustness to global orientation. We propose a conveyor-based inference, which progressively increases the diffusion timestamp with the frame position, allowing us to process sequences of any length. Through extensive evaluation, we demonstrate that ECHO outperforms existing methods that do not offer the same flexibility, setting a state-of-the-art in egocentric HOI reconstruction.