Adapting Large VLMs with Iterative and Manual Instructions for Generative Low-light Enhancement

๐Ÿ“… 2025-07-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing low-light image enhancement (LLIE) methods primarily rely on low-light inputs or pre-trained model priors, failing to effectively leverage semantic information inherent in well-lit reference imagesโ€”thereby limiting generalization under extreme low-light conditions. To address this, we propose VLM-IMI, the first LLIE framework integrating large vision-language models (VLMs) to enable semantics-guided restoration via dynamic cross-modal image-text feature alignment and iterative human-in-the-loop instruction refinement. Our key contributions include: (i) an instruction-prior fusion module that incorporates textual descriptions as learnable semantic cues; and (ii) a human-AI collaborative feedback mechanism that progressively improves semantic consistency and detail fidelity of restored outputs. Extensive experiments demonstrate that VLM-IMI achieves significant gains over state-of-the-art methods across multiple benchmarks, with marked improvements in structural integrity, semantic alignment, and texture recovery.

Technology Category

Application Category

๐Ÿ“ Abstract
Most existing low-light image enhancement (LLIE) methods rely on pre-trained model priors, low-light inputs, or both, while neglecting the semantic guidance available from normal-light images. This limitation hinders their effectiveness in complex lighting conditions. In this paper, we propose VLM-IMI, a novel framework that leverages large vision-language models (VLMs) with iterative and manual instructions (IMIs) for LLIE. VLM-IMI incorporates textual descriptions of the desired normal-light content as enhancement cues, enabling semantically informed restoration. To effectively integrate cross-modal priors, we introduce an instruction prior fusion module, which dynamically aligns and fuses image and text features, promoting the generation of detailed and semantically coherent outputs. During inference, we adopt an iterative and manual instruction strategy to refine textual instructions, progressively improving visual quality. This refinement enhances structural fidelity, semantic alignment, and the recovery of fine details under extremely low-light conditions. Extensive experiments across diverse scenarios demonstrate that VLM-IMI outperforms state-of-the-art methods in both quantitative metrics and perceptual quality. The source code is available at https://github.com/sunxiaoran01/VLM-IMI.
Problem

Research questions and friction points this paper is trying to address.

Enhancing low-light images using semantic guidance from normal-light images
Integrating cross-modal priors for detailed and coherent image restoration
Improving visual quality through iterative refinement of textual instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages large vision-language models for enhancement
Integrates textual descriptions as enhancement cues
Uses iterative manual instructions for refinement
๐Ÿ”Ž Similar Papers
No similar papers found.