A2VIS: Amodal-Aware Approach to Video Instance Segmentation

📅 2024-12-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

In video instance segmentation (VIS) and multi-object tracking (MOT), frequent identity switches and severe segmentation fragmentation under occlusion stem from the fundamental challenge of modeling the complete geometric structure of occluded regions. To address this, we propose the first unified VIS framework that jointly leverages modality-aware perception (visible appearance) and amodal perception (complete shape priors). Our approach introduces spatiotemporally consistent global instance prototypes and a novel mask head featuring intra-clip visibility-guided feature extraction and inter-clip amodal feature aggregation. We further design a spatiotemporal prototype fusion mechanism and multi-clip interaction module to enhance temporal coherence. By co-training visible and amodal representations, our method significantly improves instance completeness and trajectory stability under occlusion. On YouTube-VIS, OVIS, and BDD100K benchmarks, it achieves state-of-the-art performance in both VIS and MOT metrics—reducing ID switches by 23.6% and segmentation fragmentation by 19.4%.

Technology Category

Application Category

📝 Abstract

Handling occlusion remains a significant challenge for video instance-level tasks like Multiple Object Tracking (MOT) and Video Instance Segmentation (VIS). In this paper, we propose a novel framework, Amodal-Aware Video Instance Segmentation (A2VIS), which incorporates amodal representations to achieve a reliable and comprehensive understanding of both visible and occluded parts of objects in a video. The key intuition is that awareness of amodal segmentation through spatiotemporal dimension enables a stable stream of object information. In scenarios where objects are partially or completely hidden from view, amodal segmentation offers more consistency and less dramatic changes along the temporal axis compared to visible segmentation. Hence, both amodal and visible information from all clips can be integrated into one global instance prototype. To effectively address the challenge of video amodal segmentation, we introduce the spatiotemporal-prior Amodal Mask Head, which leverages visible information intra clips while extracting amodal characteristics inter clips. Through extensive experiments and ablation studies, we show that A2VIS excels in both MOT and VIS tasks in identifying and tracking object instances with a keen understanding of their full shape.

Problem

Research questions and friction points this paper is trying to address.

Handling occlusion in video instance segmentation tasks

Integrating amodal and visible object parts for reliable understanding

Improving consistency in object tracking with spatiotemporal amodal segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Amodal-Aware Video Instance Segmentation framework

Spatiotemporal-prior Amodal Mask Head

Global instance prototype integration

🔎 Similar Papers

Context-Aware Video Instance Segmentation