Autoregressive Universal Video Segmentation Model

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video segmentation methods perform well in prompted settings but lack a unified framework for prompt-free (fully automatic) detection and tracking. This work proposes the Autoregressive Universal Segmentation Model (AUSM), the first to introduce State Space Models (SSMs) into video segmentation, reformulating it as a sequential mask prediction task to jointly handle both prompted and prompt-free scenarios. AUSM employs a fixed-dimensional state representation, enabling efficient processing of arbitrarily long video sequences while supporting cross-frame parallel training and sequential inference. Evaluated on standard benchmarks—including DAVIS17 and YouTube-VOS—AUSM achieves state-of-the-art performance. Moreover, it attains up to 2.5× faster training on 16-frame sequences compared to prior methods, significantly enhancing both model generality and computational efficiency.

Technology Category

Application Category

📝 Abstract
Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation that aims to detect and track all objects in a video without external cues, leaving today's landscape fragmented across task-specific models and pipelines. We recast streaming video segmentation as sequential mask prediction, analogous to language modeling, and introduce the Autoregressive Universal Segmentation Model (AUSM), a single architecture that unifies both prompted and unprompted video segmentation. Built on recent state-space models, AUSM maintains a fixed-size spatial state and scales to video streams of arbitrary length. Furthermore, all components of AUSM are designed for parallel training across frames, yielding substantial speedups over iterative training. On standard benchmarks (DAVIS17, YouTube-VOS 2018 & 2019, MOSE, YouTube-VIS 2019 & 2021, and OVIS) AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16-frame sequences.
Problem

Research questions and friction points this paper is trying to address.

Unifying prompted and unprompted video segmentation tasks
Detecting and tracking all objects without external cues
Addressing fragmented task-specific models with single architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive sequential mask prediction model
Unified prompted and unprompted segmentation architecture
Parallel training design for faster processing
🔎 Similar Papers
No similar papers found.