๐ค AI Summary
Video-language alignment faces challenges including linguistic complexity, dynamic entity evolution, behavioral chain modeling, and semantic gaps between modalities. To address these, we propose the Planner-Refiner frameworkโthe first to introduce a language-guided iterative spatiotemporal refinement mechanism: the Planner decomposes long linguistic instructions into sequential short sentences, while the Refiner dynamically refines visual token representations via spatial-to-temporal self-attention and recurrent structures conditioned on noun-verb phrase pairs. This design significantly enhances comprehension of complex, compositional language. To rigorously evaluate long-query understanding, we introduce MeViS-X, a novel benchmark specifically designed for this purpose. Our method achieves state-of-the-art performance across both referring expression video segmentation and temporal grounding tasks, outperforming all existing approaches on MeViS-X and multiple mainstream benchmarks.
๐ Abstract
Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements' space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens' self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We demonstrate Planner-Refiner's effectiveness on two video-language alignment tasks: Referring Video Object Segmentation and Temporal Grounding with varying language complexity. We further introduce a new MeViS-X benchmark to assess models' capability with long queries. Superior performance versus state-of-the-art methods on these benchmarks shows the approach's potential, especially for complex prompts.