Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos

📅 2025-08-10

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Video-language alignment faces challenges including linguistic complexity, dynamic entity evolution, behavioral chain modeling, and semantic gaps between modalities. To address these, we propose the Planner-Refiner framework—the first to introduce a language-guided iterative spatiotemporal refinement mechanism: the Planner decomposes long linguistic instructions into sequential short sentences, while the Refiner dynamically refines visual token representations via spatial-to-temporal self-attention and recurrent structures conditioned on noun-verb phrase pairs. This design significantly enhances comprehension of complex, compositional language. To rigorously evaluate long-query understanding, we introduce MeViS-X, a novel benchmark specifically designed for this purpose. Our method achieves state-of-the-art performance across both referring expression video segmentation and temporal grounding tasks, outperforming all existing approaches on MeViS-X and multiple mainstream benchmarks.

Technology Category

Application Category

📝 Abstract

Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements' space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens' self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We demonstrate Planner-Refiner's effectiveness on two video-language alignment tasks: Referring Video Object Segmentation and Temporal Grounding with varying language complexity. We further introduce a new MeViS-X benchmark to assess models' capability with long queries. Superior performance versus state-of-the-art methods on these benchmarks shows the approach's potential, especially for complex prompts.

Problem

Research questions and friction points this paper is trying to address.

Addresses vision-language alignment complexity in videos

Bridges semantic gaps via iterative space-time refinement

Handles complex linguistic prompts with decomposition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iteratively refines space-time visual representations

Decomposes complex language into short sentence chains

Uses recurrent system for efficient single-step refinement

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs