Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing group activity detection methods rely on implicit visual modeling, lacking contextual reasoning and interpretability. To address this, we propose a language-instruction-guided multimodal large language model framework that enables fine-grained group activity recognition through vision–language collaborative reasoning. Our key contributions are: (1) introducing activity-level <ACT> and group-level <GROUP> semantic tokens to integrate pretrained commonsense knowledge; (2) designing a multi-label classification loss to support fine-grained behavioral modeling; and (3) proposing a Multimodal Dual-Alignment Fusion (MDAF) module for cross-modal feature alignment and complementary enhancement. Extensive experiments on multiple benchmark datasets demonstrate significant improvements over state-of-the-art methods. Both quantitative evaluations and qualitative analyses confirm the framework’s superior semantic understanding, interpretable reasoning capability, and robustness in complex, real-world scenarios.

Technology Category

Application Category

📝 Abstract

Group activity detection (GAD) aims to simultaneously identify group members and categorize their collective activities within video sequences. Existing deep learning-based methods develop specialized architectures (e.g., transformer networks) to model the dynamics of individual roles and semantic dependencies between individuals and groups. However, they rely solely on implicit pattern recognition from visual features and struggle with contextual reasoning and explainability. In this work, we propose LIR-GAD, a novel framework of language-instructed reasoning for GAD via Multimodal Large Language Model (MLLM). Our approach expand the original vocabulary of MLLM by introducing an activity-level <ACT> token and multiple cluster-specific <GROUP> tokens. We process video frames alongside two specially designed tokens and language instructions, which are then integrated into the MLLM. The pretrained commonsense knowledge embedded in the MLLM enables the <ACT> token and <GROUP> tokens to effectively capture the semantic information of collective activities and learn distinct representational features of different groups, respectively. Also, we introduce a multi-label classification loss to further enhance the <ACT> token's ability to learn discriminative semantic representations. Then, we design a Multimodal Dual-Alignment Fusion (MDAF) module that integrates MLLM's hidden embeddings corresponding to the designed tokens with visual features, significantly enhancing the performance of GAD. Both quantitative and qualitative experiments demonstrate the superior performance of our proposed method in GAD taks.

Problem

Research questions and friction points this paper is trying to address.

Detecting group activities and members in videos

Overcoming implicit pattern recognition limitations in GAD

Enhancing contextual reasoning via multimodal language instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces activity-level and group-specific tokens

Uses Multimodal Dual-Alignment Fusion module

Leverages MLLM commonsense knowledge for reasoning

🔎 Similar Papers

No similar papers found.