End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of depth ambiguity and occlusion in 4D human-object interaction (HOI) reconstruction from monocular video, where existing methods suffer from high latency and error accumulation due to multi-stage pipelines or iterative optimization. We propose THO, the first end-to-end framework for real-time 4D HOI reconstruction, built upon a spatiotemporal Transformer architecture. THO integrates spatial contact region priors to infer occluded object features and incorporates cross-frame kinematic temporal priors to enhance physical plausibility and temporal coherence. Requiring only a monocular RGB video and 3D object templates as input, THO achieves a real-time inference speed of 31.5 FPS on a single RTX 4090 GPU—over 600× faster than prior optimization-based approaches—while simultaneously improving reconstruction accuracy and stability.

Technology Category

Application Category

📝 Abstract
Monocular 4D human-object interaction (HOI) reconstruction - recovering a moving human and a manipulated object from a single RGB video - remains challenging due to depth ambiguity and frequent occlusions. Existing methods often rely on multi-stage pipelines or iterative optimization, leading to high inference latency, failing to meet real-time requirements, and susceptibility to error accumulation. To address these limitations, we propose THO, an end-to-end Spatial-Temporal Transformer that predicts human motion and coordinated object motion in a forward fashion from the given video and 3D template. THO achieves this by leveraging spatial-temporal HOI tuple priors. Spatial priors exploit contact-region proximity to infer occluded object features from human cues, while temporal priors capture cross-frame kinematic correlations to refine object representations and enforce physical coherence. Extensive experiments demonstrate that THO operates at an inference speed of 31.5 FPS on a single RTX 4090 GPU, achieving a >600x speedup over prior optimization-based methods while simultaneously improving reconstruction accuracy and temporal consistency. The project page is available at: https://nianheng.github.io/THO-project/
Problem

Research questions and friction points this paper is trying to address.

4D human-object interaction
monocular reconstruction
real-time
occlusion
depth ambiguity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-Temporal Transformer
End-to-End HOI Reconstruction
Monocular 4D Reconstruction
Real-time Inference
Contact-aware Priors
🔎 Similar Papers
No similar papers found.
Haoyu Zhang
Haoyu Zhang
Ph.D. candidate, Norwegian University of Science and Technology
W
Wei Zhai
University of Science and Technology of China
Y
Yuhang Yang
University of Science and Technology of China
Yang Cao
Yang Cao
University of Science and Technology of China
computer visionimage processing
Z
Zheng-Jun Zha
University of Science and Technology of China