End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenges of depth ambiguity and occlusion in 4D human-object interaction (HOI) reconstruction from monocular video, where existing methods suffer from high latency and error accumulation due to multi-stage pipelines or iterative optimization. We propose THO, the first end-to-end framework for real-time 4D HOI reconstruction, built upon a spatiotemporal Transformer architecture. THO integrates spatial contact region priors to infer occluded object features and incorporates cross-frame kinematic temporal priors to enhance physical plausibility and temporal coherence. Requiring only a monocular RGB video and 3D object templates as input, THO achieves a real-time inference speed of 31.5 FPS on a single RTX 4090 GPU—over 600× faster than prior optimization-based approaches—while simultaneously improving reconstruction accuracy and stability.

Technology Category

Application Category

📝 Abstract

Monocular 4D human-object interaction (HOI) reconstruction - recovering a moving human and a manipulated object from a single RGB video - remains challenging due to depth ambiguity and frequent occlusions. Existing methods often rely on multi-stage pipelines or iterative optimization, leading to high inference latency, failing to meet real-time requirements, and susceptibility to error accumulation. To address these limitations, we propose THO, an end-to-end Spatial-Temporal Transformer that predicts human motion and coordinated object motion in a forward fashion from the given video and 3D template. THO achieves this by leveraging spatial-temporal HOI tuple priors. Spatial priors exploit contact-region proximity to infer occluded object features from human cues, while temporal priors capture cross-frame kinematic correlations to refine object representations and enforce physical coherence. Extensive experiments demonstrate that THO operates at an inference speed of 31.5 FPS on a single RTX 4090 GPU, achieving a >600x speedup over prior optimization-based methods while simultaneously improving reconstruction accuracy and temporal consistency. The project page is available at: https://nianheng.github.io/THO-project/

Problem

Research questions and friction points this paper is trying to address.

4D human-object interaction

monocular reconstruction

real-time

occlusion

depth ambiguity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-Temporal Transformer

End-to-End HOI Reconstruction

Monocular 4D Reconstruction