Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing group activity detection methods rely on implicit visual modeling, lacking contextual reasoning and interpretability. To address this, we propose a language-instruction-guided multimodal large language model framework that enables fine-grained group activity recognition through vision–language collaborative reasoning. Our key contributions are: (1) introducing activity-level <ACT> and group-level <GROUP> semantic tokens to integrate pretrained commonsense knowledge; (2) designing a multi-label classification loss to support fine-grained behavioral modeling; and (3) proposing a Multimodal Dual-Alignment Fusion (MDAF) module for cross-modal feature alignment and complementary enhancement. Extensive experiments on multiple benchmark datasets demonstrate significant improvements over state-of-the-art methods. Both quantitative evaluations and qualitative analyses confirm the framework’s superior semantic understanding, interpretable reasoning capability, and robustness in complex, real-world scenarios.

Technology Category

Application Category

📝 Abstract
Group activity detection (GAD) aims to simultaneously identify group members and categorize their collective activities within video sequences. Existing deep learning-based methods develop specialized architectures (e.g., transformer networks) to model the dynamics of individual roles and semantic dependencies between individuals and groups. However, they rely solely on implicit pattern recognition from visual features and struggle with contextual reasoning and explainability. In this work, we propose LIR-GAD, a novel framework of language-instructed reasoning for GAD via Multimodal Large Language Model (MLLM). Our approach expand the original vocabulary of MLLM by introducing an activity-level <ACT> token and multiple cluster-specific <GROUP> tokens. We process video frames alongside two specially designed tokens and language instructions, which are then integrated into the MLLM. The pretrained commonsense knowledge embedded in the MLLM enables the <ACT> token and <GROUP> tokens to effectively capture the semantic information of collective activities and learn distinct representational features of different groups, respectively. Also, we introduce a multi-label classification loss to further enhance the <ACT> token's ability to learn discriminative semantic representations. Then, we design a Multimodal Dual-Alignment Fusion (MDAF) module that integrates MLLM's hidden embeddings corresponding to the designed tokens with visual features, significantly enhancing the performance of GAD. Both quantitative and qualitative experiments demonstrate the superior performance of our proposed method in GAD taks.
Problem

Research questions and friction points this paper is trying to address.

Detecting group activities and members in videos
Overcoming implicit pattern recognition limitations in GAD
Enhancing contextual reasoning via multimodal language instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces activity-level and group-specific tokens
Uses Multimodal Dual-Alignment Fusion module
Leverages MLLM commonsense knowledge for reasoning
🔎 Similar Papers
No similar papers found.
J
Jihua Peng
The Hong Kong Polytechnic University
Q
Qianxiong Xu
Nanyang Technological University
Y
Yichen Liu
SenseTime Research
C
Chenxi Liu
Nanyang Technological University
Cheng Long
Cheng Long
Nanyang Technological University
databasesmachine learningdata mining
R
Rui Zhao
SenseTime Research
Ziyue Li
Ziyue Li
CS PhD, University of Maryland
Machine learning