Pixels or Positions? Benchmarking Modalities in Group Activity Recognition

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing group activity recognition (GAR) research lacks a unified evaluation benchmark for fair performance comparison between video-pixel and individual-trajectory modalities. Method: We introduce SoccerNet-GAR—the first synchronized multimodal football dataset with 94,285 annotated instances—and propose a standardized evaluation protocol. We design a role-aware graph neural network that explicitly models tactical structures and spatiotemporal interactions, augmented with a spatiotemporal attention mechanism. Contribution/Results: Experiments show trajectory-based models achieve 67.2% balanced accuracy—significantly outperforming the best video-based model (58.1%)—while training 4.25× faster and requiring only 197K parameters (1/438 of the video model’s). Our core contributions are: (1) establishing the first cross-modal GAR benchmark; (2) proposing a structural-aware architecture; and (3) systematically demonstrating the trajectory modality’s superior accuracy, efficiency, and parameter efficiency.

Technology Category

Application Category

📝 Abstract

Group Activity Recognition (GAR) is well studied on the video modality for surveillance and indoor team sports (e.g., volleyball, basketball). Yet, other modalities such as agent positions and trajectories over time, i.e. tracking, remain comparatively under-explored despite being compact, agent-centric signals that explicitly encode spatial interactions. Understanding whether pixel (video) or position (tracking) modalities leads to better group activity recognition is therefore important to drive further research on the topic. However, no standardized benchmark currently exists that aligns broadcast video and tracking data for the same group activities, leading to a lack of apples-to-apples comparison between these modalities for GAR. In this work, we introduce SoccerNet-GAR, a multimodal dataset built from the $64$ matches of the football World Cup 2022. Specifically, the broadcast videos and player tracking modalities for $94{,}285$ group activities are synchronized and annotated with $10$ categories. Furthermore, we define a unified evaluation protocol to benchmark two strong unimodal approaches: (i) a competitive video-based classifiers and (ii) a tracking-based classifiers leveraging graph neural networks. In particular, our novel role-aware graph architecture for tracking-based GAR directly encodes tactical structure through positional edges and temporal attention. Our tracking model achieves $67.2%$ balanced accuracy compared to $58.1%$ for the best video baseline, while training $4.25 imes$ faster with $438 imes$ fewer parameters ($197K$ vs $86.3M$). This study provides new insights into the relative strengths of pixels and positions for group activity recognition. Overall, it highlights the importance of modality choice and role-aware modeling for GAR.

Problem

Research questions and friction points this paper is trying to address.

Comparing video pixels versus player tracking for group activity recognition

Creating standardized multimodal benchmark for fair modality comparison

Developing role-aware graph models for tracking-based activity classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced multimodal dataset with synchronized video and tracking data

Developed role-aware graph architecture with positional edges

Implemented temporal attention mechanism for tracking-based classification

🔎 Similar Papers

No similar papers found.