SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the limitation of existing unified medical models that decouple multimodal understanding and generation, hindering high-quality medical image synthesis. The authors propose a novel "generation-aligned understanding" paradigm that jointly optimizes understanding and generation objectives through task alignment within a unified framework, SynerMedGen. To enable deep synergy, they design three understanding tasks explicitly tailored for generation. The study introduces SynerMed, a large-scale dataset comprising 1 million synthetic samples and 2 million understanding instances, and employs a two-stage training strategy to transfer understanding representations to image generation. Remarkably, training solely on understanding tasks yields strong zero-shot performance across 22 medical image synthesis benchmarks; further fine-tuning with generation data consistently outperforms both specialized and current unified models, substantially enhancing generalization capability.

📝 Abstract

Unifying multimodal understanding and generation is a compelling frontier that is beginning to emerge in the medical field. However, the limited existing unified medical models typically treat understanding and generation as disjoint objectives, lacking a meaningful functional synergy. In this work, we identify and address a critical question in unified medical modeling: what form of understanding truly benefits generation. We present SynerMedGen, a unified framework built on the proposed principle of generation-aligned understanding, which synergizes understanding objectives with generation tasks via task alignment. SynerMedGen introduces three generation-aligned understanding tasks and a two-stage training strategy that transfers generation-beneficial representations learned during understanding training to medical image synthesis. Remarkably, even with understanding training alone, our SynerMedGen achieves strong zero-shot performance across 22 medical image synthesis tasks and demonstrates robust generalization to unseen datasets. When combined with generation training, SynerMedGen consistently outperforms state-of-the-art specialized medical image synthesis models as well as recent unified medical models. We also release a large-scale dataset named SynerMed consisting of 1M paired synthesis samples and 2M generation-derived understanding instances to support further research on understanding-generation synergy. Our project can be accessed at https://github.com/Mhilab/SynerMedGen.

Problem

Research questions and friction points this paper is trying to address.

medical multimodal understanding

generation

task alignment

unified modeling

synergy

Innovation

Methods, ideas, or system contributions that make the work stand out.

generation-aligned understanding

task alignment

unified medical modeling