MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address challenges in real-world settings—including wide-area environments, asynchronous multimodal data, and the absence of frame-level annotations—this paper introduces MultiSensor-Home, the first large-scale, multimodal, multi-view action recognition benchmark tailored for home scenarios. It comprises untrimmed high-resolution RGB and audio streams, along with fine-grained frame-level multi-view annotations. Methodologically, we propose a wide-area asynchronous multimodal modeling framework featuring a novel Transformer-driven dynamic cross-view fusion mechanism and an external human detection-guided spatial feature enhancement module. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods on both MultiSensor-Home and MM-Office, validating its robustness to asynchrony, inter-view discrepancies, and environmental variations, as well as its strong generalization capability across diverse indoor settings.

Technology Category

Application Category

📝 Abstract

Multi-modal multi-view action recognition is a rapidly growing field in computer vision, offering significant potential for applications in surveillance. However, current datasets often fail to address real-world challenges such as wide-area environmental conditions, asynchronous data streams, and the lack of frame-level annotations. Furthermore, existing methods face difficulties in effectively modeling inter-view relationships and enhancing spatial feature learning. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method and introduce the MultiSensor-Home dataset, a novel benchmark designed for comprehensive action recognition in home environments. The MultiSensor-Home dataset features untrimmed videos captured by distributed sensors, providing high-resolution RGB and audio data along with detailed multi-view frame-level action labels. The proposed MultiTSF method leverages a Transformer-based fusion mechanism to dynamically model inter-view relationships. Furthermore, the method also integrates a external human detection module to enhance spatial feature learning. Experiments on MultiSensor-Home and MM-Office datasets demonstrate the superiority of MultiTSF over the state-of-the-art methods. The quantitative and qualitative results highlight the effectiveness of the proposed method in advancing real-world multi-modal multi-view action recognition.

Problem

Research questions and friction points this paper is trying to address.

Addressing real-world challenges in multi-modal multi-view action recognition

Modeling inter-view relationships effectively with Transformer-based fusion

Enhancing spatial feature learning for comprehensive action recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based fusion for inter-view relationships

External human detection for spatial features

Multi-modal dataset with frame-level labels

🔎 Similar Papers

C3T: Cross-modal Transfer Through Time for Human Action Recognition