LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object Segmentation

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Complex video object segmentation (VOS) faces core challenges including poor long-term consistency and weak generalization under dense small objects, frequent object disappearance/reappearance, severe occlusion, and adverse conditions (e.g., rain, fog, low illumination). To address these, we introduce MOSEv2—the first VOS benchmark targeting real-world complex scenarios—and propose an enhanced J&F metric emphasizing multi-scale and disappeared-object evaluation. Methodologically, we design a language-vision collaborative architecture that integrates multimodal large model (MLLM)-driven semantic understanding with memory-aware propagation, significantly improving language-guided accuracy and temporal consistency modeling. Experiments demonstrate state-of-the-art performance on MOSEv2. Furthermore, we systematically analyze technical evolution trends toward open-world robust VOS, establishing MOSEv2 as a new standard benchmark and providing concrete directions for future research.

Technology Category

Application Category

📝 Abstract

This report presents an overview of the 7th Large-scale Video Object Segmentation (LSVOS) Challenge held in conjunction with ICCV 2025. Besides the two traditional tracks of LSVOS that jointly target robustness in realistic video scenarios: Classic VOS (VOS), and Referring VOS (RVOS), the 2025 edition features a newly introduced track, Complex VOS (MOSEv2). Building upon prior insights, MOSEv2 substantially increases difficulty, introducing more challenging but realistic scenarios including denser small objects, frequent disappear/reappear events, severe occlusions, adverse weather and lighting, etc., pushing long-term consistency and generalization beyond curated benchmarks. The challenge retains standard ${J}$, $F$, and ${J&F}$ metrics for VOS and RVOS, while MOSEv2 adopts ${J&dot{F}}$ as the primary ranking metric to better evaluate objects across scales and disappearance cases. We summarize datasets and protocols, highlight top-performing solutions, and distill emerging trends, such as the growing role of LLM/MLLM components and memory-aware propagation, aiming to chart future directions for resilient, language-aware video segmentation in the wild.

Problem

Research questions and friction points this paper is trying to address.

Advancing video object segmentation in complex realistic scenarios

Addressing long-term consistency under occlusion and disappearance

Improving generalization beyond curated benchmarks through new metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Complex VOS track with realistic scenarios

Adopts J&F metric for multi-scale object evaluation

Integrates LLM components for language-aware video segmentation

🔎 Similar Papers

No similar papers found.