Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos

๐Ÿ“… 2025-08-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Video-language alignment faces challenges including linguistic complexity, dynamic entity evolution, behavioral chain modeling, and semantic gaps between modalities. To address these, we propose the Planner-Refiner frameworkโ€”the first to introduce a language-guided iterative spatiotemporal refinement mechanism: the Planner decomposes long linguistic instructions into sequential short sentences, while the Refiner dynamically refines visual token representations via spatial-to-temporal self-attention and recurrent structures conditioned on noun-verb phrase pairs. This design significantly enhances comprehension of complex, compositional language. To rigorously evaluate long-query understanding, we introduce MeViS-X, a novel benchmark specifically designed for this purpose. Our method achieves state-of-the-art performance across both referring expression video segmentation and temporal grounding tasks, outperforming all existing approaches on MeViS-X and multiple mainstream benchmarks.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements' space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens' self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We demonstrate Planner-Refiner's effectiveness on two video-language alignment tasks: Referring Video Object Segmentation and Temporal Grounding with varying language complexity. We further introduce a new MeViS-X benchmark to assess models' capability with long queries. Superior performance versus state-of-the-art methods on these benchmarks shows the approach's potential, especially for complex prompts.
Problem

Research questions and friction points this paper is trying to address.

Addresses vision-language alignment complexity in videos
Bridges semantic gaps via iterative space-time refinement
Handles complex linguistic prompts with decomposition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iteratively refines space-time visual representations
Decomposes complex language into short sentence chains
Uses recurrent system for efficient single-step refinement
๐Ÿ”Ž Similar Papers
No similar papers found.
Tuyen Tran
Tuyen Tran
Deakin University
T
Thao Minh Le
Applied Artificial Intelligence Institute, Deakin University, Australia
Q
Quang-Hung Le
Applied Artificial Intelligence Institute, Deakin University, Australia
Truyen Tran
Truyen Tran
Professor | Head of AI, Health and Science @ Deakin University
artificial intelligenceAI for healthAI for science