ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-light video instance segmentation (VIS) suffers significant performance degradation due to spatiotemporal degradations—including noise, motion blur, and low contrast—compounded by the scarcity of real-world annotated data and effective synthetic low-light video generation methods. To address this, we propose the first unsupervised low-light video synthesis framework that jointly models spatial and temporal degradations. We design a calibration-free Video Degradation Prior Network (VDP-Net) to learn degradation distributions directly from unlabeled videos, and introduce a decoupled enhancement decoder that explicitly separates content representations from degradation features. Our approach leverages self-supervised degradation modeling, spatiotemporal consistency constraints, and domain-adaptive training. Evaluated on synthetically generated low-light YouTube-VIS 2019, it achieves a +3.7 AP improvement over state-of-the-art VIS models and fine-tuning baselines. This work establishes a robust, scalable, end-to-end solution for low-light VIS.

Technology Category

Application Category

📝 Abstract
Video instance segmentation (VIS) for low-light content remains highly challenging for both humans and machines alike, due to adverse imaging conditions including noise, blur and low-contrast. The lack of large-scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low-light videos and, as a result, perform poorly even when finetuned on low-light data. In this paper, we introduce extbf{ELVIS} ( extbf{E}nhance extbf{L}ow-light for extbf{V}ideo extbf{I}nstance extbf{S}egmentation), a novel framework that enables effective domain adaptation of state-of-the-art VIS models to low-light scenarios. ELVIS comprises an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile synthesis network (VDP-Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to extbf{+3.7AP} on the synthetic low-light YouTube-VIS 2019 dataset. Code will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Addresses low-light video instance segmentation challenges
Overcomes lack of large-scale annotated low-light datasets
Enhances robustness to degradations like noise and blur
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised synthetic pipeline modeling spatiotemporal degradations
Calibration-free degradation profile synthesis network VDP-Net
Enhancement decoder disentangling degradations from content features
🔎 Similar Papers
No similar papers found.