MultiTSF: Transformer-based Sensor Fusion for Human-Centric Multi-view and Multi-modal Action Recognition

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Real-world multi-view, multimodal human action recognition faces challenges including environmental interference, sensor asynchrony, and reliance on fine-grained annotations. To address these, this paper proposes a dynamic cross-view Transformer fusion framework. Methodologically, it introduces a cross-view attention mechanism for adaptive inter-view feature alignment; incorporates a human-detection-driven pseudo-label generation module to enhance frame-level spatial focus and weakly supervised feature learning; and designs a multimodal temporal modeling strategy with explicit asynchronous sensor alignment. Evaluated on the MultiSensor-Home and MM-Office benchmarks, the framework achieves significant improvements over state-of-the-art methods in both video-level and frame-level accuracy. Results demonstrate superior robustness and generalization under realistic deployment conditions—particularly in the presence of sensor heterogeneity, timing misalignment, and limited annotation supervision.

Technology Category

Application Category

📝 Abstract

Action recognition from multi-modal and multi-view observations holds significant potential for applications in surveillance, robotics, and smart environments. However, existing methods often fall short of addressing real-world challenges such as diverse environmental conditions, strict sensor synchronization, and the need for fine-grained annotations. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF). The proposed method leverages a Transformer-based to dynamically model inter-view relationships and capture temporal dependencies across multiple views. Additionally, we introduce a Human Detection Module to generate pseudo-ground-truth labels, enabling the model to prioritize frames containing human activity and enhance spatial feature learning. Comprehensive experiments conducted on our in-house MultiSensor-Home dataset and the existing MM-Office dataset demonstrate that MultiTSF outperforms state-of-the-art methods in both video sequence-level and frame-level action recognition settings.

Problem

Research questions and friction points this paper is trying to address.

Addressing multi-modal multi-view action recognition challenges

Overcoming diverse environmental conditions and sensor synchronization

Reducing reliance on fine-grained annotations for training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based dynamic inter-view relationship modeling

Human Detection Module for pseudo-ground-truth labels

Enhanced spatial feature learning via prioritized frames

🔎 Similar Papers

C3T: Cross-modal Transfer Through Time for Human Action Recognition