🤖 AI Summary
This work addresses the limited ability of existing language-conditioned multimodal large models to recognize and restore degraded images in real-world image super-resolution. To overcome this, the authors propose a degradation-aware strategy that introduces a Realistic Embedding Extractor (REE) to enhance degradation content recognition and integrates a Conditional Feature Modulator (CFM) to inject high-level semantic information into a Mamba-based network for high-quality texture recovery. The method uniquely combines the Recognize Anything Model (RAM), contrastive learning, and degradation-aware embeddings, marking the first effort to integrate the Mamba architecture with conditional semantic guidance for real-world image super-resolution. Experiments demonstrate that the proposed approach achieves a superior balance between fidelity and perceptual quality, significantly improving visual reconstruction and validating the potential of Mamba in real-world super-resolution tasks.
📝 Abstract
Multimodal large models have shown excellent ability in addressing image super-resolution in real-world scenarios by leveraging language class as condition information, yet their abilities in degraded images remain limited. In this paper, we first revisit the capabilities of the Recognize Anything Model (RAM) for degraded images by calculating text similarity. We find that directly using contrastive learning to fine-tune RAM in the degraded space is difficult to achieve acceptable results. To address this issue, we employ a degradation selection strategy to propose a Real Embedding Extractor (REE), which achieves significant recognition performance gain on degraded image content through contrastive learning. Furthermore, we use a Conditional Feature Modulator (CFM) to incorporate the high-level information of REE for a powerful Mamba-based network, which can leverage effective pixel information to restore image textures and produce visually pleasing results. Extensive experiments demonstrate that the REE can effectively help image super-resolution networks balance fidelity and perceptual quality, highlighting the great potential of Mamba in real-world applications. The source code of this work will be made publicly available at: https://github.com/nathan66666/DACESR.git