🤖 AI Summary
Group Activity Detection (GAD) faces a fundamental challenge: Vision Foundation Models (VFMs) lack inherent capacity to model group dynamics due to their object-centric pretraining paradigm, which fails to capture social interactions. To address this, we propose a group-aware prompt-guided inference framework. Our method introduces (1) a learnable group prompt mechanism that explicitly steers VFM attention—e.g., DINOv2—toward social relational cues, and (2) a lightweight GroupContext Transformer that jointly models actor-group hierarchical associations and infers collective activities. With only 10M trainable parameters, our approach achieves substantial improvements on multi-group benchmarks: +6.5% Group mAP@1.0 and +8.2% Group mAP@0.5 on Cafe and Social-CAD. Moreover, it generates interpretable attention maps, offering an efficient and transparent paradigm for adapting VFMs to GAD.
📝 Abstract
Group Activity Detection (GAD) involves recognizing social groups and their collective behaviors in videos. Vision Foundation Models (VFMs), like DinoV2, offer excellent features, but are pretrained primarily on object-centric data and remain underexplored for modeling group dynamics. While they are a promising alternative to highly task-specific GAD architectures that require full fine-tuning, our initial investigation reveals that simply swapping CNN backbones used in these methods with VFMs brings little gain, underscoring the need for structured, group-aware reasoning on top.
We introduce Prompt-driven Group Activity Detection (ProGraD) -- a method that bridges this gap through 1) learnable group prompts to guide the VFM attention toward social configurations, and 2) a lightweight two-layer GroupContext Transformer that infers actor-group associations and collective behavior. We evaluate our approach on two recent GAD benchmarks: Cafe, which features multiple concurrent social groups, and Social-CAD, which focuses on single-group interactions. While we surpass state-of-the-art in both settings, our method is especially effective in complex multi-group scenarios, where we yield a gain of 6.5% (Group mAP@1.0) and 8.2% (Group mAP@0.5) using only 10M trainable parameters. Furthermore, our experiments reveal that ProGraD produces interpretable attention maps, offering insights into actor-group reasoning. Code and models will be released.