🤖 AI Summary
This work addresses the challenge of efficiently achieving consistent 3D reconstruction of humans and scenes in multi-person, multi-view videos, a task where existing methods often rely on monocular inputs or additional preprocessing modules. We propose CHROMM, a unified framework that, for the first time, jointly estimates camera parameters, scene point clouds, and human meshes in a single forward pass without external components. By introducing a scale-adaptation module to mitigate scale discrepancies between humans and scenes, along with a geometry-driven multi-person association mechanism and a test-time multi-view fusion strategy, our approach significantly enhances robustness and efficiency. Integrating geometric and human priors from Pi3X and Multi-HMR, CHROMM forms an end-to-end trainable network that achieves state-of-the-art performance on EMDB, RICH, EgoHumans, and EgoExo4D, with inference speeds over eight times faster than prior methods.
📝 Abstract
Recent advances in 3D foundation models have led to growing interest in reconstructing humans and their surrounding environments. However, most existing approaches focus on monocular inputs, and extending them to multi-view settings requires additional overhead modules or preprocessed data. To this end, we present CHROMM, a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos without relying on external modules or preprocessing. We integrate strong geometric and human priors from Pi3X and Multi-HMR into a single trainable neural network architecture, and introduce a scale adjustment module to solve the scale discrepancy between humans and the scene. We also introduce a multi-view fusion strategy to aggregate per-view estimates into a single representation at test-time. Finally, we propose a geometry-based multi-person association method, which is more robust than appearance-based approaches. Experiments on EMDB, RICH, EgoHumans, and EgoExo4D show that CHROMM achieves competitive performance in global human motion and multi-view pose estimation while running over 8x faster than prior optimization-based multi-view approaches. Project page: https://nstar1125.github.io/chromm.