🤖 AI Summary
Current video large language models (ViLLMs) lack identity-awareness, hindering reliable identification and understanding of specific individuals’ activities and dialogues within a single video—limiting their applicability in personalized domains such as smart healthcare and intelligent homes. To address this, we propose PVChat, the first identity-aware, few-shot video question-answering framework. PVChat introduces ReLU Routing Mixture-of-Heads (MoH) attention, smooth proximity regularization, and head activation enhancement to strengthen subject-centric feature modeling. It further incorporates progressive image-to-video learning, automated identity-preserving data augmentation, and a two-stage training strategy. We also construct a benchmark comprising four identity-consistent synthetic video–question–answer datasets. Extensive experiments demonstrate that PVChat significantly outperforms state-of-the-art ViLLMs across diverse domains—including medical videos, TV series, anime, and real-world footage—achieving accurate and robust identity-aware single-video QA.
📝 Abstract
Video large language models (ViLLMs) excel in general video understanding, e.g., recognizing activities like talking and eating, but struggle with identity-aware comprehension, such as"Wilson is receiving chemotherapy"or"Tom is discussing with Sarah", limiting their applicability in smart healthcare and smart home environments. To address this limitation, we propose a one-shot learning framework PVChat, the first personalized ViLLM that enables subject-aware question answering (QA) from a single video for each subject. Our approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset, leveraging a progressive image-to-video learning strategy. Specifically, we introduce an automated augmentation pipeline that synthesizes identity-preserving positive samples and retrieves hard negatives from existing video corpora, generating a diverse training dataset with four QA types: existence, appearance, action, and location inquiries. To enhance subject-specific learning, we propose a ReLU Routing MoH attention mechanism, alongside two novel objectives: (1) Smooth Proximity Regularization for progressive learning through exponential distance scaling and (2) Head Activation Enhancement for balanced attention routing. Finally, we adopt a two-stage training strategy, transitioning from image pre-training to video fine-tuning, enabling a gradual learning process from static attributes to dynamic representations. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage, demonstrating its superiority in personalized feature understanding after learning from a single video, compared to state-of-the-art ViLLMs.