A Transformer-based Multimodal Fusion Model for Efficient Crowd Counting Using Visual and Wireless Signals

📅 2025-04-28

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address the accuracy limitations of unimodal (image- or CSI-only) crowd counting methods arising from insufficient information, this paper proposes the first end-to-end CSI-vision bimodal crowd counting framework. Methodologically, we design a hybrid Transformer-CNN architecture: a Transformer encoder captures global contextual dependencies, while CNNs extract fine-grained local visual features; additionally, we introduce cross-modal feature alignment and adaptive weighted fusion to enable deep synergy between CSI phase/amplitude representations and image semantics. To our knowledge, this is the first work to systematically integrate wireless Channel State Information (CSI) into crowd counting. Extensive experiments across diverse real-world scenarios demonstrate that our approach reduces the mean absolute error by 32.7% compared to unimodal baselines and state-of-the-art fusion methods, while achieving an inference speed of 28 FPS.

Technology Category

Application Category

📝 Abstract

Current crowd-counting models often rely on single-modal inputs, such as visual images or wireless signal data, which can result in significant information loss and suboptimal recognition performance. To address these shortcomings, we propose TransFusion, a novel multimodal fusion-based crowd-counting model that integrates Channel State Information (CSI) with image data. By leveraging the powerful capabilities of Transformer networks, TransFusion effectively combines these two distinct data modalities, enabling the capture of comprehensive global contextual information that is critical for accurate crowd estimation. However, while transformers are well capable of capturing global features, they potentially fail to identify finer-grained, local details essential for precise crowd counting. To mitigate this, we incorporate Convolutional Neural Networks (CNNs) into the model architecture, enhancing its ability to extract detailed local features that complement the global context provided by the Transformer. Extensive experimental evaluations demonstrate that TransFusion achieves high accuracy with minimal counting errors while maintaining superior efficiency.

Problem

Research questions and friction points this paper is trying to address.

Integrate visual and wireless signals for accurate crowd counting

Overcome single-modal limitations causing information loss

Combine Transformer and CNN for global-local feature fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based multimodal fusion for crowd counting

Integrates CSI and image data via Transformers

Combines CNNs with Transformers for local details

🔎 Similar Papers

Information Fusion in Multimodal IoT Systems for physical activity level monitoring