🤖 AI Summary
This work addresses the challenges of low-light image enhancement, where images often suffer from noise, low contrast, and semantic ambiguity, necessitating joint denoising and detail recovery. The paper proposes the first pixel-space feedforward enhancement framework that leverages DINOv3 to extract semantic priors, which guide cross-scale denoising to preserve structural integrity. Semantic information is further injected via novel DINO-Prompted Pixel Blocks. The method innovatively integrates Spatial-Channel Compression (SCC) with Multi-Receptive-Field Pixel Embedding (MRPE) to efficiently model both local and global contextual dependencies. Evaluated across multiple benchmarks, the approach achieves PSNR gains of 1.9–15.0% and LPIPS reductions of 8.5–44.4%, significantly improving detail sharpness and texture consistency.
📝 Abstract
Low-light images exhibit severe noise, contrast loss, and semantic ambiguity, making enhancement a joint problem of denoising and detail recovery. We propose PixIE, a feed-forward pixel-space LLIE framework semantically-prompted by a vision foundation model. PixIE first performs a cross-scale denoising to suppress noise and preserve structure, then refines details with DINO-Prompted Pixel Blocks (DPPB) that inject intermediate DINOv3 features via patch-conditioned, spatially continuous per-pixel modulation. We introduce a Spatial-Channel Compaction (SCC), which folds features into a compact spatial grid and compresses in the channel dimension, so pixel-attention is computed efficiently with bounded cost across scales. We further propose Multi-Receptive-Field Pixel Embedding (MRPE) to provide neighborhood-aware pixel representations before semantic prompting, improving robustness to signal-dependent noise beyond point-wise embeddings. Experiments on LLIE benchmarks show that PixIE improves the average PSNR by 1.9-15.0% over recent state-of-the-art methods and reduces LPIPS by 8.5-44.4%. Qualitative comparisons further demonstrate that PixIE recovers sharper details and more stable textures, resulting in improved reconstruction fidelity and perceptual quality.