🤖 AI Summary
This work addresses the tendency of standard diffusion models to generate clinically implausible or hallucinated content in medical image synthesis, and the absence of effective metrics to evaluate their pathological alignment. To this end, we propose Clinical Reward-Aligned Finetuning (CRAFT), a framework that integrates multimodal large language models and vision-language models to transfer clinical knowledge into diffusion models via label-conditioned prompts, clinical checklists, and a differentiable reward mechanism. We introduce the Clinical Alignment Score (CAS) as a novel proxy metric and optimize generation quality across four dimensions of clinical relevance. Experiments demonstrate that CRAFT consistently improves CAS scores and downstream classification performance across four medical imaging modalities, reduces low-alignment tail samples by 20.4% on average, and shows strong validity through physician blind reviews, structured audits, and memory analyses.
📝 Abstract
Foundation diffusion models can generate photorealistic natural images, but adapting them to medical imaging remains challenging. In medical adaptation, limited labeled data can exacerbate hallucination-like and clinically implausible synthesis, while existing metrics such as FID or Inception Score do not quantify per-image alignment with pathology-relevant criteria. We introduce the Clinical Alignment Score (CAS), a foundation-model-based proxy for clinical alignment that evaluates generated images along four complementary dimensions beyond visual fidelity. Building on CAS, we propose Clinical Reward-Aligned Finetuning (CRAFT), a reward-based adaptation framework that transfers medical knowledge from multimodal large language models and vision-language models through label-conditioned prompt enrichment, clinical checklists, and differentiable reward optimization. Across four diverse modalities, CRAFT improves CAS and downstream classification performance over strong adaptation baselines. Beyond average CAS gains, CRAFT reduces the empirical low-alignment tail below a real-image reference threshold by 5.5-34.7% points relative to the strongest baseline, corresponding to a 20.4% average relative reduction across datasets. These results indicate fewer hallucination-like generations under CAS, and are corroborated by out-of-family evaluator evaluation, structured checklist auditing, memorization analysis, and a blinded physician preference study on CheXpert.