Leveraging Audio Representations for Vibration-Based Crowd Monitoring in Stadiums

📅 2025-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address privacy concerns and sensor-induced interference in crowd monitoring within sports venues, this work proposes a non-intrusive behavioral prediction paradigm based on floor vibration sensing. The core challenge lies in the extreme scarcity of labeled vibration data. To overcome this, we introduce the first cross-modal pretraining framework for vibration analysis: audio representations—pretrained unsupervised on YouTube-8M—are transferred to the vibration domain via self-supervised audio representation learning and cross-modal feature alignment, enabling semantic modeling of vibration signals. We then design a lightweight vibration-specific model and fine-tune it with minimal supervision. Experiments demonstrate that our approach achieves a 5.8× reduction in prediction error using only a small amount of in-situ labeled data, substantially alleviating data dependency. This enables low-interference, high-privacy public-space perception without cameras or microphones.

Technology Category

Application Category

📝 Abstract
Crowd monitoring in sports stadiums is important to enhance public safety and improve the audience experience. Existing approaches mainly rely on cameras and microphones, which can cause significant disturbances and often raise privacy concerns. In this paper, we sense floor vibration, which provides a less disruptive and more non-intrusive way of crowd sensing, to predict crowd behavior. However, since the vibration-based crowd monitoring approach is newly developed, one main challenge is the lack of training data due to sports stadiums being large public spaces with complex physical activities. In this paper, we present ViLA (Vibration Leverage Audio), a vibration-based method that reduces the dependency on labeled data by pre-training with unlabeled cross-modality data. ViLA is first pre-trained on audio data in an unsupervised manner and then fine-tuned with a minimal amount of in-domain vibration data. By leveraging publicly available audio datasets, ViLA learns the wave behaviors from audio and then adapts the representation to vibration, reducing the reliance on domain-specific vibration data. Our real-world experiments demonstrate that pre-training the vibration model using publicly available audio data (YouTube8M) achieved up to a 5.8x error reduction compared to the model without audio pre-training.
Problem

Research questions and friction points this paper is trying to address.

Develops vibration-based crowd monitoring for stadiums
Reduces need for labeled data via audio pre-training
Improves accuracy by leveraging cross-modality wave behaviors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vibration sensing for crowd monitoring
Pre-trains model with unlabeled audio data
Reduces need for domain-specific vibration data
🔎 Similar Papers
No similar papers found.
Y
Yen Cheng Chang
University of Michigan, 1301 Beal Ave, Ann Arbor, Michigan 48105, USA
J
Jesse Codling
University of Michigan, 1301 Beal Ave, Ann Arbor, Michigan 48105, USA
Y
Yiwen Dong
Stanford University, Stanford, California, USA
J
Jiale Zhang
University of Michigan, 1301 Beal Ave, Ann Arbor, Michigan 48105, USA
Jiasi Chen
Jiasi Chen
University of Michigan, Ann Arbor
mobile systemsmixed realitycomputer networksmachine learning
Hae Young Noh
Hae Young Noh
Associate Professor of Civil and Environmental Engineering, Stanford University
Structures as SensorsStructural Health MonitoringPhysics-Informed LearningSmart Cities
P
Pei Zhang
University of Michigan, 1301 Beal Ave, Ann Arbor, Michigan 48105, USA