You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the high computational cost of existing methods for extreme 8× face super-resolution—reconstructing 128×128 images from 16×16 inputs—which often rely on complex architectures or adversarial training. To overcome this, the authors propose a lightweight U-Net framework that leverages facial keypoint heatmaps generated by the open-vocabulary object detector YOLO-World as semantic guidance. A novel heatmap-weighted loss function is introduced, which emphasizes reconstruction fidelity in critical regions such as the eyes, nose, and mouth without requiring additional training. By eliminating the need for dedicated keypoint networks and adversarial components, the method achieves significant improvements in PSNR and SSIM on the CelebA dataset, producing sharper and more realistic face images, thereby demonstrating the efficacy of detection priors in lightweight super-resolution.

📝 Abstract

Face image super-resolution aims to recover high-resolution facial images from severely degraded inputs. Under extreme upscaling factors, fine facial details are often lost, making accurate reconstruction challenging. Existing methods typically rely on heavy network architectures, adversarial training schemes, or separate alignment networks, increasing model complexity and computational cost. To address these issues, we propose a lightweight U-Net based-architecture designed to reconstructs $128{ \times }128$ facial images from severely degraded $16{ \times }16$ inputs, achieving an $8 \times $ magnification. A key contribution is a novel auxiliary-training-free supervision strategy that leverages heatmaps generated by YOLO-World, an open-vocabulary object detector, to localize key facial features such as eyes, nose, and mouth. These heatmaps are converted into spatial weights to form a heatmap-guided loss that emphasizes reconstruction errors in semantically important regions. Unlike prior methods that require dedicated landmark or alignment networks, our approach directly reuses detector outputs as supervision, maintaining an efficient training and inference pipeline. Experiments on the aligned CelebA dataset demonstrate that the proposed loss consistently improves quantitative metrics and produces sharper, more realistic reconstructions. Overall, our results show that lightweight networks can effectively exploit detection-driven priors for perceptually convincing extreme upscaling, without adversarial training or increased computational cost.

Problem

Research questions and friction points this paper is trying to address.

face super-resolution

extreme upscaling

lightweight networks

facial detail recovery

computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

face super-resolution

YOLO-World

heatmap-guided loss