MVAFormer: RGB-Based Multi-View Spatio-Temporal Action Recognition with Transformer

📅 2024-10-27

🏛️ International Conference on Information Photonics

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This paper addresses multi-view spatio-temporal action recognition (STAR)—the task of temporally localizing and classifying actions of multiple persons in multi-camera video sequences while mitigating occlusion. To this end, we propose MVAFormer, a novel architecture featuring a feature-map-level multi-view Transformer collaboration module that explicitly decouples intra-view self-attention from inter-view interaction, thereby modeling spatially aware multi-view spatio-temporal dependencies. To our knowledge, MVAFormer is the first framework to systematically integrate multi-view learning into the STAR setting, enabling end-to-end training directly from RGB inputs and preserving spatial correspondence during cross-view feature fusion. Experiments on a newly constructed benchmark dataset demonstrate that MVAFormer achieves a 4.4-percentage-point improvement in F-measure over strong baselines, establishing new state-of-the-art performance for multi-view spatio-temporal action recognition.

Technology Category

Application Category

📝 Abstract

Multi-view action recognition aims to recognize human actions using multiple camera views and deals with occlusion caused by obstacles or crowds. In this task, cooperation among views, which generates a joint representation by combining multiple views, is vital. Previous studies have explored promising cooperation methods for improving performance. However, since their methods focus only on the task setting of recognizing a single action from an entire video, they are not applicable to the recently popular spatio-temporal action recognition~(STAR) setting, in which each person's action is recognized sequentially. To address this problem, this paper proposes a multi-view action recognition method for the STAR setting, called MVAFormer. In MVAFormer, we introduce a novel transformer-based cooperation module among views. In contrast to previous studies, which utilize embedding vectors with lost spatial information, our module utilizes the feature map for effective cooperation in the STAR setting, which preserves the spatial information. Furthermore, in our module, we divide the self-attention for the same and different views to model the relationship between multiple views effectively. The results of experiments using a newly collected dataset demonstrate that MVAFormer outperforms the comparison baselines by approximately $4.4$ points on the F-measure.

Problem

Research questions and friction points this paper is trying to address.

Addresses multi-view spatio-temporal action recognition with occlusion challenges

Proposes transformer-based cooperation preserving spatial information for sequential actions

Enables per-person action recognition in multi-camera views using feature maps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based multi-view cooperation module

Feature maps preserving spatial information

Divided self-attention for same and different views

🔎 Similar Papers

Dark Transformer: A Video Transformer for Action Recognition in the Dark