ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Video object segmentation (VOS) under natural language queries requires joint spatiotemporal reasoning over video sequences and complex textual instructions; however, existing fine-tuned multimodal large language model (MLLM)-based approaches struggle to effectively integrate temporal dynamics with spatial semantics. To address this, we propose ThinkVideo—a training-free framework that introduces zero-shot chain-of-thought (CoT) reasoning to VOS for the first time. It leverages MLLM-guided keyframe selection and dynamic target refinement to robustly handle both explicit and implicit, as well as streaming, temporally sensitive queries. Its modular design decouples language understanding, segmentation, and video processing—enabling seamless integration with proprietary MLLMs, image-level segmenters, and SAM2-based video processors. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple benchmarks, with both quantitative metrics and qualitative analyses confirming superior performance.

Technology Category

Application Category

📝 Abstract

Reasoning Video Object Segmentation is a challenging task, which generates a mask sequence from an input video and an implicit, complex text query. Existing works probe into the problem by finetuning Multimodal Large Language Models (MLLM) for segmentation-based output, while still falling short in difficult cases on videos given temporally-sensitive queries, primarily due to the failure to integrate temporal and spatial information. In this paper, we propose ThinkVideo, a novel framework which leverages the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these challenges. Specifically, ThinkVideo utilizes the CoT prompts to extract object selectivities associated with particular keyframes, then bridging the reasoning image segmentation model and SAM2 video processor to output mask sequences. The ThinkVideo framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. We further extend the framework for online video streams, where the CoT is used to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that ThinkVideo significantly outperforms previous works in both cases, qualitatively and quantitatively.

Problem

Research questions and friction points this paper is trying to address.

Improving video object segmentation with complex text queries

Integrating temporal and spatial information for better accuracy

Enhancing zero-shot reasoning in video segmentation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages zero-shot Chain-of-Thought MLLM capability

Bridges reasoning segmentation and SAM2 video processor

Training-free framework for online video streams

🔎 Similar Papers

ViLLa: Video Reasoning Segmentation with Large Language Model