DTTNet: Improving Video Shadow Detection via Dark-Aware Guidance and Tokenized Temporal Modeling

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Video shadow detection faces two key challenges: semantic ambiguity between shadows and dark objects under complex backgrounds, and difficulty modeling shadow deformation under dynamic lighting conditions. To address these, we propose a language-guided spatiotemporal disentanglement framework. Our method introduces a vision-language matching module and a dark-region-aware block, enhanced by adaptive mask reweighting and edge-aware supervision to improve discrimination between shadows and dark objects. Additionally, we incorporate learnable temporal tokens and a Temporal Tokenization Block (TTB) to enable efficient spatiotemporal disentanglement. The framework jointly leverages vision-language pretrained features, dark-aware semantic representation, and edge-mask supervision. Evaluated on multiple benchmarks, our approach achieves state-of-the-art performance while supporting real-time inference—demonstrating significant improvements in both detection accuracy and computational efficiency.

Technology Category

Application Category

📝 Abstract

Video shadow detection confronts two entwined difficulties: distinguishing shadows from complex backgrounds and modeling dynamic shadow deformations under varying illumination. To address shadow-background ambiguity, we leverage linguistic priors through the proposed Vision-language Match Module (VMM) and a Dark-aware Semantic Block (DSB), extracting text-guided features to explicitly differentiate shadows from dark objects. Furthermore, we introduce adaptive mask reweighting to downweight penumbra regions during training and apply edge masks at the final decoder stage for better supervision. For temporal modeling of variable shadow shapes, we propose a Tokenized Temporal Block (TTB) that decouples spatiotemporal learning. TTB summarizes cross-frame shadow semantics into learnable temporal tokens, enabling efficient sequence encoding with minimal computation overhead. Comprehensive Experiments on multiple benchmark datasets demonstrate state-of-the-art accuracy and real-time inference efficiency. Codes are available at https://github.com/city-cheng/DTTNet.

Problem

Research questions and friction points this paper is trying to address.

Distinguishing shadows from complex backgrounds using linguistic guidance

Modeling dynamic shadow deformations under varying illumination conditions

Addressing shadow-background ambiguity and temporal shape variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language matching for shadow-background differentiation

Adaptive mask reweighting to handle penumbra regions

Tokenized temporal modeling for efficient shadow sequence encoding

🔎 Similar Papers

Unveiling Deep Shadows: A Survey and Benchmark on Image and Video Shadow Detection, Removal, and Generation in the Deep Learning Era