🤖 AI Summary
Blind face restoration is often hindered by ill-posedness, leading to blurry outputs and limited semantic controllability. To address this, this work proposes the A2BFR framework, which uniquely integrates attribute-aware supervision and text-guided control within a diffusion Transformer architecture. By leveraging cross-modal image-text attention, an attribute-aware encoder, and a semantic dual-training mechanism, the method achieves both high-fidelity reconstruction and instruction-driven controllability. The authors introduce the AttrFace-90K dataset to facilitate fine-grained attribute manipulation. Experimental results demonstrate that A2BFR outperforms baseline methods by 0.0467 in LPIPS and improves attribute accuracy by 52.58%, effectively balancing restoration quality with semantic controllability.
📝 Abstract
Blind face restoration (BFR) aims to recover high-quality facial images from degraded inputs, yet its inherently ill-posed nature leads to ambiguous and uncontrollable solutions. Recent diffusion-based BFR methods improve perceptual quality but remain uncontrollable, whereas text-guided face editing enables attribute manipulation without reliable restoration. To address these issues, we propose A$^2$BFR, an attribute-aware blind face restoration framework that unifies high-fidelity reconstruction with prompt-controllable generation. Built upon a Diffusion Transformer backbone with unified image-text cross-modal attention, A$^2$BFR jointly conditions the denoising trajectory on both degraded inputs and textual prompts. To inject semantic priors, we introduce attribute-aware learning, which supervises denoising latents using facial attribute embeddings extracted by an attribute-aware encoder. To further enhance prompt controllability, we introduce semantic dual-training, which leverages the pairwise attribute variations in our newly curated AttrFace-90K dataset to enforce attribute discrimination while preserving fidelity. Extensive experiments demonstrate that A$^2$BFR achieves state-of-the-art performance in both restoration fidelity and instruction adherence, outperforming diffusion-based BFR baselines by -0.0467 LPIPS and +52.58% attribute accuracy, while enabling fine-grained, prompt-controllable restoration even under severe degradations.