๐ค AI Summary
To address the insufficient modeling of dynamic disease progression in sequential chest X-ray images for radiology report generation (RRG), this paper proposes a time-aware multimodal large language model (MLLM) framework. Methodologically, we design a radiology-specific image encoder and a Time-Aware Connector (TAC) to enable fine-grained difference modeling and cross-temporal semantic alignment between current and prior studies. An end-to-end joint training strategy is adopted to fully exploit clinical reasoning cues embedded in multi-temporal, multimodal data. Evaluated on the MIMIC-CXR dataset, our approach achieves significant improvements in clinical relevance (+3.2%) and lexical accuracy (+2.8%), establishing new state-of-the-art performance among models of comparable scale. To the best of our knowledge, this is the first work to realize time-sensitive visionโlanguage co-modeling explicitly tailored for radiological tasks.
๐ Abstract
Radiology report generation (RRG) requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. While multimodal large language models (MLLMs) align with pre-trained vision encoders to enhance visual-language understanding, most existing methods rely on single-image analysis or rule-based heuristics to process multiple images, failing to fully leverage temporal information in multi-modal medical datasets. In this paper, we introduce Libra, a temporal-aware MLLM tailored for chest X-ray report generation. Libra combines a radiology-specific image encoder with a novel Temporal Alignment Connector (TAC), designed to accurately capture and integrate temporal differences between paired current and prior images. Extensive experiments on the MIMIC-CXR dataset demonstrate that Libra establishes a new state-of-the-art benchmark among similarly scaled MLLMs, setting new standards in both clinical relevance and lexical accuracy.