Human3R: Everyone Everywhere All at Once

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenging problem of monocular video-based 4D human-scene joint reconstruction. We propose the first end-to-end, single-stage, online feedforward framework that simultaneously reconstructs global multi-person SMPL-X body models, dense 3D scene geometry, and camera trajectories. Departing from conventional multi-stage pipelines—relying on preprocessed human detection, depth estimation, and SLAM—and iterative optimization, our method builds upon the CUT3R architecture and introduces a parameter-efficient visual prompting tuning mechanism to jointly decode multi-person SMPL-X parameters, implicit scene representations, and camera poses in a single forward pass. Trained on a single GPU for only one day, the model achieves real-time inference at 15 FPS with just 8 GB GPU memory. It attains state-of-the-art or leading performance across multiple benchmarks, marking the first solution enabling “one-shot” online 4D human-scene co-reconstruction.

Technology Category

Application Category

📝 Abstract

We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies, e.g., human detection, depth estimation, and SLAM pre-processing, Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scene ("everywhere"), and camera trajectories in a single forward pass ("all-at-once"). Our method builds upon the 4D online reconstruction model CUT3R, and uses parameter-efficient visual prompt tuning, to strive to preserve CUT3R's rich spatiotemporal priors, while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. We hope that Human3R will serve as a simple yet strong baseline, be easily extended for downstream applications.Code available in https://fanegg.github.io/Human3R

Problem

Research questions and friction points this paper is trying to address.

Unified framework reconstructs 4D human-scene interactions from monocular videos

Eliminates multi-stage pipelines and iterative refinement dependencies

Achieves real-time performance with single forward pass processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified feed-forward framework for 4D reconstruction

Single forward pass recovers bodies, scenes, cameras

Real-time performance with low memory footprint

🔎 Similar Papers

No similar papers found.