DTTNet: Improving Video Shadow Detection via Dark-Aware Guidance and Tokenized Temporal Modeling

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video shadow detection faces two key challenges: semantic ambiguity between shadows and dark objects under complex backgrounds, and difficulty modeling shadow deformation under dynamic lighting conditions. To address these, we propose a language-guided spatiotemporal disentanglement framework. Our method introduces a vision-language matching module and a dark-region-aware block, enhanced by adaptive mask reweighting and edge-aware supervision to improve discrimination between shadows and dark objects. Additionally, we incorporate learnable temporal tokens and a Temporal Tokenization Block (TTB) to enable efficient spatiotemporal disentanglement. The framework jointly leverages vision-language pretrained features, dark-aware semantic representation, and edge-mask supervision. Evaluated on multiple benchmarks, our approach achieves state-of-the-art performance while supporting real-time inference—demonstrating significant improvements in both detection accuracy and computational efficiency.

Technology Category

Application Category

📝 Abstract
Video shadow detection confronts two entwined difficulties: distinguishing shadows from complex backgrounds and modeling dynamic shadow deformations under varying illumination. To address shadow-background ambiguity, we leverage linguistic priors through the proposed Vision-language Match Module (VMM) and a Dark-aware Semantic Block (DSB), extracting text-guided features to explicitly differentiate shadows from dark objects. Furthermore, we introduce adaptive mask reweighting to downweight penumbra regions during training and apply edge masks at the final decoder stage for better supervision. For temporal modeling of variable shadow shapes, we propose a Tokenized Temporal Block (TTB) that decouples spatiotemporal learning. TTB summarizes cross-frame shadow semantics into learnable temporal tokens, enabling efficient sequence encoding with minimal computation overhead. Comprehensive Experiments on multiple benchmark datasets demonstrate state-of-the-art accuracy and real-time inference efficiency. Codes are available at https://github.com/city-cheng/DTTNet.
Problem

Research questions and friction points this paper is trying to address.

Distinguishing shadows from complex backgrounds using linguistic guidance
Modeling dynamic shadow deformations under varying illumination conditions
Addressing shadow-background ambiguity and temporal shape variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language matching for shadow-background differentiation
Adaptive mask reweighting to handle penumbra regions
Tokenized temporal modeling for efficient shadow sequence encoding
🔎 Similar Papers
No similar papers found.
Z
Zhicheng Li
School of Computer Science and Technology / School of Artificial Intelligence, China University of Mining and Technology
K
Kunyang Sun
School of Computer Science and Technology / School of Artificial Intelligence, China University of Mining and Technology
R
Rui Yao
School of Computer Science and Technology / School of Artificial Intelligence, China University of Mining and Technology
H
Hancheng Zhu
School of Computer Science and Technology / School of Artificial Intelligence, China University of Mining and Technology
Fuyuan Hu
Fuyuan Hu
Professor of Suzhou University of Science and Technology
Machine LearningComputer Vision
Jiaqi Zhao
Jiaqi Zhao
Xidian University
privacy-preserving machine learning
Z
Zhiwen Shao
School of Computer Science and Technology / School of Artificial Intelligence, China University of Mining and Technology
Y
Yong Zhou
School of Computer Science and Technology / School of Artificial Intelligence, China University of Mining and Technology