🤖 AI Summary
This work addresses unified understanding of complex dynamic scenes by proposing a lightweight, multi-task-compatible vision-language joint modeling framework. To tackle tasks including situation recognition, human-human interaction detection, and human-object interaction detection, it abandons task-specific architectures and instead formulates them uniformly as structured text generation or cross-modal representation concatenation problems. The core contribution is the first systematic validation and exploitation of the dynamic semantic encoding capability inherent in frozen large-scale vision-language models (e.g., CLIP, ALIGN), augmented only with minimal trainable components—lightweight adapter heads—enabling end-to-end multi-task optimization. The approach achieves state-of-the-art performance across multiple dynamic scene understanding benchmarks while drastically reducing parameter count. Crucially, it reveals that mainstream vision-and-language models already possess strong implicit representations for dynamic scenes, obviating the need for extensive architectural redesign or full model fine-tuning.
📝 Abstract
Images depicting complex, dynamic scenes are challenging to parse automatically, requiring both high-level comprehension of the overall situation and fine-grained identification of participating entities and their interactions. Current approaches use distinct methods tailored to sub-tasks such as Situation Recognition and detection of Human-Human and Human-Object Interactions. However, recent advances in image understanding have often leveraged web-scale vision-language (V&L) representations to obviate task-specific engineering. In this work, we propose a framework for dynamic scene understanding tasks by leveraging knowledge from modern, frozen V&L representations. By framing these tasks in a generic manner - as predicting and parsing structured text, or by directly concatenating representations to the input of existing models - we achieve state-of-the-art results while using a minimal number of trainable parameters relative to existing approaches. Moreover, our analysis of dynamic knowledge of these representations shows that recent, more powerful representations effectively encode dynamic scene semantics, making this approach newly possible.