MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations

📅 2025-03-02

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Current vision-language pretraining (VLP) methods emphasize discriminative understanding but lack text-driven medical image generation capabilities, limiting the completeness and practicality of multimodal modeling. To address this, we propose the first generative VLP framework for medical imaging, introducing a discrete cross-modal representation learning paradigm based on vector-quantized variational autoencoders (VQ-VAEs) that unifies understanding and generation. Our method jointly optimizes four objectives: image–text contrastive alignment, cross-modal matching, text-to-image generation, and image-to-text generation. Evaluated on medical image retrieval, zero-shot classification, radiology report generation, and text-conditioned image synthesis, our approach achieves state-of-the-art performance across all tasks. It significantly improves generation quality, semantic fidelity, and cross-task generalization. This work establishes a new generative paradigm for medical multimodal intelligence, enabling both interpretable reasoning and controllable synthesis.

Technology Category

Application Category

📝 Abstract

Despite significant progress in Vision-Language Pre-training (VLP), current approaches predominantly emphasize feature extraction and cross-modal comprehension, with limited attention to generating or transforming visual content. This gap hinders the model's ability to synthesize coherent and novel visual representations from textual prompts, thereby reducing the effectiveness of multi-modal learning. In this work, we propose MedUnifier, a unified VLP framework tailored for medical data. MedUnifier seamlessly integrates text-grounded image generation capabilities with multi-modal learning strategies, including image-text contrastive alignment, image-text matching and image-grounded text generation. Unlike traditional methods that reply on continuous visual representations, our approach employs visual vector quantization, which not only facilitates a more cohesive learning strategy for cross-modal understanding but also enhances multi-modal generation quality by effectively leveraging discrete representations. Our framework's effectiveness is evidenced by the experiments on established benchmarks, including uni-modal tasks (supervised fine-tuning), cross-modal tasks (image-text retrieval and zero-shot image classification), and multi-modal tasks (medical report generation, image synthesis), where it achieves state-of-the-art performance across various tasks. MedUnifier also offers a highly adaptable tool for a wide range of language and vision tasks in healthcare, marking advancement toward the development of a generalizable AI model for medical applications.

Problem

Research questions and friction points this paper is trying to address.

Enhances multi-modal learning by integrating image generation with text.

Improves cross-modal understanding using discrete visual representations.

Achieves state-of-the-art performance in medical vision-language tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates text-grounded image generation with multi-modal learning

Uses visual vector quantization for discrete representations

Achieves state-of-the-art in medical image-text tasks

🔎 Similar Papers

No similar papers found.