🤖 AI Summary
Referring Video Object Segmentation (RVOS) faces the core challenge of fine-grained cross-modal alignment between dynamic video content and static linguistic descriptions—particularly when target objects exhibit similar appearance but distinct motion patterns or poses. To address this, we propose PARSE-VOS: the first framework leveraging large language models (LLMs) for semantic parsing and a two-stage conditional reasoning mechanism that hierarchically aligns text and video via motion-based coarse filtering followed by pose-aware fine verification—requiring no end-to-end training. Our method integrates natural language parsing, spatiotemporal localization, and hierarchical recognition modules to explicitly model compositional semantic structures in referring expressions. Evaluated on three major benchmarks—Ref-YouTube-VOS, Ref-DAVIS17, and MeViS—PARSE-VOS achieves state-of-the-art performance, with substantial improvements in segmentation accuracy for complex, ambiguous scenarios involving motion-pose ambiguity and occlusion.
📝 Abstract
Referring Video Object Segmentation (RVOS) aims to segment an object of interest throughout a video based on a language description. The prominent challenge lies in aligning static text with dynamic visual content, particularly when objects exhibiting similar appearances with inconsistent motion and poses. However, current methods often rely on a holistic visual-language fusion that struggles with complex, compositional descriptions. In this paper, we propose extbf{PARSE-VOS}, a novel, training-free framework powered by Large Language Models (LLMs), for a hierarchical, coarse-to-fine reasoning across text and video domains. Our approach begins by parsing the natural language query into structured semantic commands. Next, we introduce a spatio-temporal grounding module that generates all candidate trajectories for all potential target objects, guided by the parsed semantics. Finally, a hierarchical identification module select the correct target through a two-stage reasoning process: it first performs coarse-grained motion reasoning with an LLM to narrow down candidates; if ambiguity remains, a fine-grained pose verification stage is conditionally triggered to disambiguate. The final output is an accurate segmentation mask for the target object. extbf{PARSE-VOS} achieved state-of-the-art performance on three major benchmarks: Ref-YouTube-VOS, Ref-DAVIS17, and MeViS.