🤖 AI Summary
This work addresses the challenges of underwater image degradation—such as color distortion, low contrast, and poor visibility—caused by light absorption and scattering. Existing methods often suffer from limited generalization due to rigid physical assumptions or insufficient training data. To overcome these limitations, the authors propose a novel enhancement framework that integrates Retinex theory with language-guided semantic priors. The framework features a prior-free illumination estimator, a cross-modal text alignment module, and a semantic-guided restorer, leveraging CLIP-generated textual descriptions to provide high-level semantic guidance. This study pioneers the incorporation of textual semantics into underwater image enhancement, introduces LUIQD-TD—the first large-scale image-text underwater dataset—and designs an Image-Text Semantic Consistency (ITSS) loss. Experiments demonstrate that the method achieves state-of-the-art or comparable performance against 15 leading approaches across four public benchmarks and a newly curated dataset, significantly improving both visual quality and semantic fidelity.
📝 Abstract
Underwater images often suffer from severe degradation caused by light absorption and scattering, leading to color distortion, low contrast and reduced visibility. Existing Underwater Image Enhancement (UIE) methods can be divided into two categories, i.e., prior-based and learning-based methods. The former rely on rigid physical assumptions that limit the adaptability, while the latter often face data scarcity and weak generalization. To address these issues, we propose a Physics-Semantics-Guided Underwater Image Enhancement Network (PSG-UIENet), which couples the Retinex-grounded illumination correction with the language-informed guidance. This network comprises a Prior-Free Illumination Estimator, a Cross-Modal Text Aligner and a Semantics-Guided Image Restorer. In particular, the restorer leverages the textual descriptions generated by the Contrastive Language-Image Pre-training (CLIP) model to inject high-level semantics for perceptually meaningful guidance. Since multimodal UIE data sets are not publicly available, we also construct a large-scale image-text UIE data set, namely, LUIQD-TD, which contains 6,418 image-reference-text triplets. To explicitly measure and optimize semantic consistency between textual descriptions and images, we further design an Image-Text Semantic Similarity (ITSS) loss function. To our knowledge, this study makes the first effort to introduce both textual guidance and the multimodal data set into UIE tasks. Extensive experiments on our data set and four publicly available data sets demonstrate that the proposed PSG-UIENet achieves superior or comparable performance against fifteen state-of-the-art methods.