Revisiting Shadow Detection from a Vision-Language Perspective

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the challenge of distinguishing shadows from intrinsically dark regions in visually ambiguous scenes, where pixel-level cues alone are insufficient. The authors propose the first approach to integrate vision–language alignment into shadow detection by leveraging a frozen DINOv2 image encoder to extract global visual features and aligning them with shadow-related text embeddings. A lightweight projection and decoding module enables global semantic guidance, while a novel local textual constraint and a global–local consistency mechanism effectively couple semantic priors with pixel-wise predictions. Remarkably, the method introduces fewer than 1% trainable parameters yet achieves significant performance gains across multiple benchmarks, demonstrating notably enhanced robustness on ambiguous and challenging samples.

📝 Abstract

Shadow detection is commonly formulated as a vision-driven dense prediction problem, where models rely primarily on pixel-wise visual supervision to distinguish shadows from non-shadow regions. However, this formulation can become unreliable in visually ambiguous cases, where similar dark regions may correspond either to cast shadows or to intrinsically dark surfaces, making visual evidence alone insufficient for establishing a stable decision rule. In this work, we revisit shadow detection from a vision--language perspective and argue that robust prediction benefits from an explicit semantic reference beyond visual cues alone. We propose SVL, a Shadow Vision--Language framework that uses language as an explicit semantic reference to disambiguate shadows from visually similar dark regions. SVL aligns the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, thereby providing image-level guidance on the overall extent of shadows. To transfer this global guidance to dense inference, SVL introduces a global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions. In parallel, SVL applies local patch-level constraints with text embeddings to improve fine-grained discrimination under challenging appearance conditions. Built on a frozen DINOv3 image encoder, the framework learns only lightweight projection and decoding modules, yielding a parameter-efficient design with less than $1\%$ trainable parameters. Extensive experiments on multiple shadow detection benchmarks, including dedicated hard-case evaluations, suggest strong overall performance and improved robustness under visually ambiguous conditions.

Problem

Research questions and friction points this paper is trying to address.

shadow detection

visual ambiguity

semantic reference

vision-language

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language alignment

shadow detection

global-to-local coupling

semantic disambiguation