🤖 AI Summary
Clinical dermatological images are scarce, and existing synthetic methods suffer from low fidelity and insufficient lesion controllability. To address this, we propose LF-VAR, the first model that jointly encodes quantitative lesion measurement scores and categorical labels as language-style conditional embeddings. LF-VAR integrates a multi-scale lesion-focused VQ-VAE—ensuring high-fidelity latent representation—and a vision autoregressive Transformer—enabling sequential, token-wise generation. This design enables fine-grained control over lesion location, morphology, and type. Evaluated on seven common skin lesion classes, LF-VAR achieves a Fréchet Inception Distance (FID) of 0.74, outperforming the state-of-the-art by 6.3%. The generated images exhibit significantly improved photorealism and clinical relevance, demonstrating strong potential for data augmentation and training of downstream diagnostic models.
📝 Abstract
Skin images from real-world clinical practice are often limited, resulting in a shortage of training data for deep-learning models. While many studies have explored skin image synthesis, existing methods often generate low-quality images and lack control over the lesion's location and type. To address these limitations, we present LF-VAR, a model leveraging quantified lesion measurement scores and lesion type labels to guide the clinically relevant and controllable synthesis of skin images. It enables controlled skin synthesis with specific lesion characteristics based on language prompts. We train a multiscale lesion-focused Vector Quantised Variational Auto-Encoder (VQVAE) to encode images into discrete latent representations for structured tokenization. Then, a Visual AutoRegressive (VAR) Transformer trained on tokenized representations facilitates image synthesis. Lesion measurement from the lesion region and types as conditional embeddings are integrated to enhance synthesis fidelity. Our method achieves the best overall FID score (average 0.74) among seven lesion types, improving upon the previous state-of-the-art (SOTA) by 6.3%. The study highlights our controllable skin synthesis model's effectiveness in generating high-fidelity, clinically relevant synthetic skin images. Our framework code is available at https://github.com/echosun1996/LF-VAR.