🤖 AI Summary
Existing video-language models struggle with pixel-level fine-grained alignment, particularly under complex scenes and temporal dynamics, leading to insufficient localization accuracy. To address this, we propose the first spatio-temporal collaborative architecture for video grounding. Our method introduces a dual-path visual encoder (spatial + temporal) coupled with a spatio-temporal masked decoder—establishing the novel “dual encoder + decoder” paradigm. We further incorporate learnable vision–language (V-L) and language–vision (L-V) adapters to enhance cross-modal alignment. Additionally, we construct a large-scale grounding video dataset comprising 38K video-QA triplets and 671K high-quality pixel-accurate masks. Our unified framework supports three core tasks: referring video segmentation, grounded dialogue generation, and visual grounding. Extensive experiments demonstrate consistent and significant improvements over state-of-the-art methods across all benchmarks, substantially enhancing both spatio-temporal consistency and pixel-level localization accuracy.
📝 Abstract
Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V-L and L-V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.