MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mobile object segmentation (MOS) from a single image—without temporal cues—is an underexplored yet critical challenge, especially for real-time applications like autonomous driving. Method: We propose the first purely single-frame MOS framework. It leverages multimodal large language models (MLLMs) with chain-of-thought (CoT) reasoning to generate semantic prompts, jointly harnessing SAM and vision-language models (VLMs) for cross-modal feature alignment and logic-guided segmentation. An iterative reasoning refinement loop further enhances scene understanding and segmentation accuracy. Contributions/Results: (1) We formally define and solve the video-free MOS task for the first time; (2) we establish an interpretable, end-to-end single-image MOS paradigm; (3) our method achieves 92.5% J&F on public MOS benchmarks—significantly outperforming prior single-frame approaches—and demonstrates robust performance in real-world autonomous driving scenarios, matching or exceeding multi-frame methods.

Technology Category

Application Category

📝 Abstract
Moving object segmentation plays a vital role in understanding dynamic visual environments. While existing methods rely on multi-frame image sequences to identify moving objects, single-image MOS is critical for applications like motion intention prediction and handling camera frame drops. However, segmenting moving objects from a single image remains challenging for existing methods due to the absence of temporal cues. To address this gap, we propose MovSAM, the first framework for single-image moving object segmentation. MovSAM leverages a Multimodal Large Language Model (MLLM) enhanced with Chain-of-Thought (CoT) prompting to search the moving object and generate text prompts based on deep thinking for segmentation. These prompts are cross-fused with visual features from the Segment Anything Model (SAM) and a Vision-Language Model (VLM), enabling logic-driven moving object segmentation. The segmentation results then undergo a deep thinking refinement loop, allowing MovSAM to iteratively improve its understanding of the scene context and inter-object relationships with logical reasoning. This innovative approach enables MovSAM to segment moving objects in single images by considering scene understanding. We implement MovSAM in the real world to validate its practical application and effectiveness for autonomous driving scenarios where the multi-frame methods fail. Furthermore, despite the inherent advantage of multi-frame methods in utilizing temporal information, MovSAM achieves state-of-the-art performance across public MOS benchmarks, reaching 92.5% on J&F. Our implementation will be available at https://github.com/IRMVLab/MovSAM.
Problem

Research questions and friction points this paper is trying to address.

Segments moving objects from single images without temporal cues
Uses MLLM with CoT prompting for logic-driven segmentation
Achieves state-of-the-art performance in MOS benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM enhanced with CoT prompting
Cross-fusion of text and visual features
Deep thinking refinement loop
🔎 Similar Papers
No similar papers found.
C
Chang Nie
Department of Automation, Key Laboratory of System Control and Information Processing of Ministry of Education, Key Laboratory of Marine Intelligent Equipment and System of Ministry of Education, Shanghai Engineering Research Center of Intelligent Control and Management, Shanghai Jiao Tong University, Shanghai 200240, China
Yiqing Xu
Yiqing Xu
Department of Political Science, Stanford University
political methodologyapplied statisticscomparative politicspositive political economy
Guangming Wang
Guangming Wang
University of Cambridge, ETH Zurich, and Shanghai Jiao Tong University
Robot VisionRobot ManipulationRoboticsComputer VisionAutonomous Driving
Z
Zhe Liu
Department of Automation, Key Laboratory of System Control and Information Processing of Ministry of Education, Key Laboratory of Marine Intelligent Equipment and System of Ministry of Education, Shanghai Engineering Research Center of Intelligent Control and Management, Shanghai Jiao Tong University, Shanghai 200240, China
Y
Yanzi Miao
The Advanced Robotics Research Center, Artificial Intelligence Research Institute and School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China
H
Hesheng Wang
Department of Automation, Key Laboratory of System Control and Information Processing of Ministry of Education, Key Laboratory of Marine Intelligent Equipment and System of Ministry of Education, Shanghai Engineering Research Center of Intelligent Control and Management, Shanghai Jiao Tong University, Shanghai 200240, China