You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This work addresses the high computational cost of existing methods for extreme 8× face super-resolution—reconstructing 128×128 images from 16×16 inputs—which often rely on complex architectures or adversarial training. To overcome this, the authors propose a lightweight U-Net framework that leverages facial keypoint heatmaps generated by the open-vocabulary object detector YOLO-World as semantic guidance. A novel heatmap-weighted loss function is introduced, which emphasizes reconstruction fidelity in critical regions such as the eyes, nose, and mouth without requiring additional training. By eliminating the need for dedicated keypoint networks and adversarial components, the method achieves significant improvements in PSNR and SSIM on the CelebA dataset, producing sharper and more realistic face images, thereby demonstrating the efficacy of detection priors in lightweight super-resolution.
📝 Abstract
Face image super-resolution aims to recover high-resolution facial images from severely degraded inputs. Under extreme upscaling factors, fine facial details are often lost, making accurate reconstruction challenging. Existing methods typically rely on heavy network architectures, adversarial training schemes, or separate alignment networks, increasing model complexity and computational cost. To address these issues, we propose a lightweight U-Net based-architecture designed to reconstructs $128{ \times }128$ facial images from severely degraded $16{ \times }16$ inputs, achieving an $8 \times $ magnification. A key contribution is a novel auxiliary-training-free supervision strategy that leverages heatmaps generated by YOLO-World, an open-vocabulary object detector, to localize key facial features such as eyes, nose, and mouth. These heatmaps are converted into spatial weights to form a heatmap-guided loss that emphasizes reconstruction errors in semantically important regions. Unlike prior methods that require dedicated landmark or alignment networks, our approach directly reuses detector outputs as supervision, maintaining an efficient training and inference pipeline. Experiments on the aligned CelebA dataset demonstrate that the proposed loss consistently improves quantitative metrics and produces sharper, more realistic reconstructions. Overall, our results show that lightweight networks can effectively exploit detection-driven priors for perceptually convincing extreme upscaling, without adversarial training or increased computational cost.
Problem

Research questions and friction points this paper is trying to address.

face super-resolution
extreme upscaling
lightweight networks
facial detail recovery
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

face super-resolution
YOLO-World
heatmap-guided loss
lightweight U-Net
auxiliary-training-free
🔎 Similar Papers
No similar papers found.