🤖 AI Summary
This work addresses the challenge of jointly optimizing semantic understanding and high-fidelity generation in chest X-ray images, which is hindered by the inherent conflict between abstract semantic representation and pixel-level reconstruction. To resolve this, we propose the first decoupled architecture that employs an autoregressive branch for understanding tasks and a diffusion branch for generation, dynamically guiding the generative process through cross-modal self-attention using features from the understanding branch. This approach breaks away from conventional parameter-sharing paradigms and, despite using only one-quarter of the parameters of LLM-CXR, achieves significant improvements: a 46.1% gain in Micro-F1 on understanding tasks and a 24.2% enhancement in generation quality as measured by the FD-RadDino metric across two benchmarks. Our method establishes a new paradigm for synergistic medical image understanding and synthesis.
📝 Abstract
Despite recent progress, medical foundation models still struggle to unify visual understanding and generation, as these tasks have inherently conflicting goals: semantic abstraction versus pixel-level reconstruction. Existing approaches, typically based on parameter-shared autoregressive architectures, frequently lead to compromised performance in one or both tasks. To address this, we present UniX, a next-generation unified medical foundation model for chest X-ray understanding and generation. UniX decouples the two tasks into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. Crucially, a cross-modal self-attention mechanism is introduced to dynamically guide the generation process with understanding features. Coupled with a rigorous data cleaning pipeline and a multi-stage training strategy, this architecture enables synergistic collaboration between tasks while leveraging the strengths of diffusion models for superior generation. On two representative benchmarks, UniX achieves a 46.1% improvement in understanding performance (Micro-F1) and a 24.2% gain in generation quality (FD-RadDino), using only a quarter of the parameters of LLM-CXR. By achieving performance on par with task-specific models, our work establishes a scalable paradigm for synergistic medical image understanding and generation. Codes and models are available at https://github.com/ZrH42/UniX.