MSMVD: Exploiting Multi-scale Image Features via Multi-scale BEV Features for Multi-view Pedestrian Detection

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-view pedestrian detection (MVPD), performance degradation arises from extreme scale variations (i.e., very small or very large pedestrians) and substantial inter-view scale discrepancies. To address this, we propose a multi-scale bird’s-eye view (BEV) feature modeling framework. Our method introduces a scale-aligned view-to-BEV projection mechanism that preserves multi-scale characteristics of each camera view at every scale level, coupled with a cross-view and cross-scale feature pyramid network for deep fusion and aggregation of multi-scale BEV features. This design effectively mitigates the sensitivity of conventional single-scale BEV representations to scale variation. Evaluated on the GMVD dataset, our approach achieves a 4.5-point improvement in MODA over the previous state-of-the-art, demonstrating significantly enhanced robustness and accuracy for detecting pedestrians under extreme and heterogeneous scale conditions across views.

Technology Category

Application Category

📝 Abstract
Multi-View Pedestrian Detection (MVPD) aims to detect pedestrians in the form of a bird's eye view (BEV) from multi-view images. In MVPD, end-to-end trainable deep learning methods have progressed greatly. However, they often struggle to detect pedestrians with consistently small or large scales in views or with vastly different scales between views. This is because they do not exploit multi-scale image features to generate the BEV feature and detect pedestrians. To overcome this problem, we propose a novel MVPD method, called Multi-Scale Multi-View Detection (MSMVD). MSMVD generates multi-scale BEV features by projecting multi-scale image features extracted from individual views into the BEV space, scale-by-scale. Each of these BEV features inherits the properties of its corresponding scale image features from multiple views. Therefore, these BEV features help the precise detection of pedestrians with consistently small or large scales in views. Then, MSMVD combines information at different scales of multiple views by processing the multi-scale BEV features using a feature pyramid network. This improves the detection of pedestrians with vastly different scales between views. Extensive experiments demonstrate that exploiting multi-scale image features via multi-scale BEV features greatly improves the detection performance, and MSMVD outperforms the previous highest MODA by $4.5$ points on the GMVD dataset.
Problem

Research questions and friction points this paper is trying to address.

Detecting pedestrians with consistently small or large scales in views
Addressing vastly different pedestrian scales between multiple views
Exploiting multi-scale image features for BEV feature generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale image features projection
Scale-by-scale BEV feature generation
Feature pyramid network integration
🔎 Similar Papers
No similar papers found.
T
Taiga Yamane
NTT Human Informatics Laboratories, NTT Corporation, Yokosuka, Japan
Satoshi Suzuki
Satoshi Suzuki
NTT
Neural networkscomputer visiondeep learningvideo coding for machines
Ryo Masumura
Ryo Masumura
Distinguished Research Scientist, NTT Corporation
Speech RecognitionSpoken Language ProcessingNatural Language ProcessingComputer Vision
S
Shota Orihashi
NTT Human Informatics Laboratories, NTT Corporation, Yokosuka, Japan
T
Tomohiro Tanaka
NTT Human Informatics Laboratories, NTT Corporation, Yokosuka, Japan
M
Mana Ihori
NTT Human Informatics Laboratories, NTT Corporation, Yokosuka, Japan
N
Naoki Makishima
NTT Human Informatics Laboratories, NTT Corporation, Yokosuka, Japan
N
Naotaka Kawata
NTT Human Informatics Laboratories, NTT Corporation, Yokosuka, Japan