🤖 AI Summary
This work addresses the challenge of representation learning in scenarios lacking annotated group activity labels by proposing a novel self-supervised pretraining approach. It introduces, for the first time, human optical flow estimation and group-related object localization as pretext tasks, integrating them within the DINO framework to jointly leverage local motion cues and global scene context. By co-optimizing local and global features, the method effectively learns semantic representations of group activities. Extensive experiments demonstrate significant performance gains in group activity retrieval and recognition across multiple public benchmarks, achieving state-of-the-art results. Ablation studies further confirm the effectiveness and novelty of each proposed component.
📝 Abstract
This paper proposes Group Activity Feature (GAF) learning without group activity annotations. Unlike prior work, which uses low-level static local features to learn GAFs, we propose leveraging dynamics-aware and group-aware pretext tasks, along with local and global features provided by DINO, for group-dynamics-aware GAF learning. To adapt DINO and GAF learning to local dynamics and global group features, our pretext tasks use person flow estimation and group-relevant object location estimation, respectively. Person flow estimation is used to represent the local motion of each person, which is an important cue for understanding group activities. In contrast, group-relevant object location estimation encourages GAFs to learn scene context (e.g., spatial relations of people and objects) as global features. Comprehensive experiments on public datasets demonstrate the state-of-the-art performance of our method in group activity retrieval and recognition. Our ablation studies verify the effectiveness of each component in our method. Code: https://github.com/tezuka0001/Group-DINOmics.