Controllable Skin Synthesis via Lesion-Focused Vector Autoregression Model

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Clinical dermatological images are scarce, and existing synthetic methods suffer from low fidelity and insufficient lesion controllability. To address this, we propose LF-VAR, the first model that jointly encodes quantitative lesion measurement scores and categorical labels as language-style conditional embeddings. LF-VAR integrates a multi-scale lesion-focused VQ-VAE—ensuring high-fidelity latent representation—and a vision autoregressive Transformer—enabling sequential, token-wise generation. This design enables fine-grained control over lesion location, morphology, and type. Evaluated on seven common skin lesion classes, LF-VAR achieves a Fréchet Inception Distance (FID) of 0.74, outperforming the state-of-the-art by 6.3%. The generated images exhibit significantly improved photorealism and clinical relevance, demonstrating strong potential for data augmentation and training of downstream diagnostic models.

Technology Category

Application Category

📝 Abstract

Skin images from real-world clinical practice are often limited, resulting in a shortage of training data for deep-learning models. While many studies have explored skin image synthesis, existing methods often generate low-quality images and lack control over the lesion's location and type. To address these limitations, we present LF-VAR, a model leveraging quantified lesion measurement scores and lesion type labels to guide the clinically relevant and controllable synthesis of skin images. It enables controlled skin synthesis with specific lesion characteristics based on language prompts. We train a multiscale lesion-focused Vector Quantised Variational Auto-Encoder (VQVAE) to encode images into discrete latent representations for structured tokenization. Then, a Visual AutoRegressive (VAR) Transformer trained on tokenized representations facilitates image synthesis. Lesion measurement from the lesion region and types as conditional embeddings are integrated to enhance synthesis fidelity. Our method achieves the best overall FID score (average 0.74) among seven lesion types, improving upon the previous state-of-the-art (SOTA) by 6.3%. The study highlights our controllable skin synthesis model's effectiveness in generating high-fidelity, clinically relevant synthetic skin images. Our framework code is available at https://github.com/echosun1996/LF-VAR.

Problem

Research questions and friction points this paper is trying to address.

Synthesizing high-quality skin images with controlled lesion characteristics

Addressing limited training data in clinical skin image datasets

Enhancing control over lesion location and type in synthetic images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lesion-focused VQVAE for discrete image tokenization

Visual autoregressive transformer for guided synthesis

Conditional embeddings integrating lesion measurements and types

🔎 Similar Papers

No similar papers found.