🤖 AI Summary
Existing multi-view pedestrian detection methods suffer from reliance on costly manual 3D annotations for human modeling, severe noise in deep feature fusion, and poor cross-scene generalization. To address these challenges, this paper proposes the Depth-Consistent Human Modeling (DCHM) framework. DCHM introduces a novel superpixel-level Gaussian splatting technique that enables sparse large-scale multi-view depth estimation and depth-consistent fusion within a global coordinate system—entirely without any 3D supervision—yielding low-noise, high-fidelity point cloud representations. By explicitly resolving view conflicts and mitigating depth estimation errors, DCHM significantly improves pedestrian localization accuracy and multi-view segmentation performance, especially in crowded scenes. Extensive experiments demonstrate that DCHM outperforms state-of-the-art methods across multiple benchmarks, achieving, for the first time, fully unsupervised, highly robust, and strongly generalizable 3D reconstruction and segmentation of pedestrians from multi-view imagery.
📝 Abstract
Multiview pedestrian detection typically involves two stages: human modeling and pedestrian localization. Human modeling represents pedestrians in 3D space by fusing multiview information, making its quality crucial for detection accuracy. However, existing methods often introduce noise and have low precision. While some approaches reduce noise by fitting on costly multiview 3D annotations, they often struggle to generalize across diverse scenes. To eliminate reliance on human-labeled annotations and accurately model humans, we propose Depth-Consistent Human Modeling (DCHM), a framework designed for consistent depth estimation and multiview fusion in global coordinates. Specifically, our proposed pipeline with superpixel-wise Gaussian Splatting achieves multiview depth consistency in sparse-view, large-scaled, and crowded scenarios, producing precise point clouds for pedestrian localization. Extensive validations demonstrate that our method significantly reduces noise during human modeling, outperforming previous state-of-the-art baselines. Additionally, to our knowledge, DCHM is the first to reconstruct pedestrians and perform multiview segmentation in such a challenging setting. Code is available on the href{https://jiahao-ma.github.io/DCHM/}{project page}.