🤖 AI Summary
Medical surgical data acquisition is severely constrained by high annotation costs and ethical limitations, necessitating high-fidelity synthetic image alternatives. To address this, we propose an end-to-end text-to-medical-image generation framework tailored for surgical scenarios. Our method introduces the first fine-grained surgical text–image alignment paradigm, incorporating anatomical structure constraint loss and surgical workflow temporal modeling to enhance clinical plausibility. Built upon diffusion models, it integrates a surgery-domain fine-tuned CLIP encoder, anatomy-aware segmentation guidance, and procedure-specific keyword-enhanced attention. Evaluated on a multi-center surgical report dataset, our approach achieves a Fréchet Inception Distance (FID) of 14.3 and an 89.7% pass rate in physician-blinded clinical validity assessment—significantly outperforming existing medical text-to-image methods. This work establishes a robust foundation for preoperative planning, surgical education, and AI-assisted annotation.