🤖 AI Summary
This study addresses the limitations of existing medical image–language datasets, which are often small in scale and biased toward negative findings, as clinical reports typically omit positive or neutral observations deemed irrelevant to diagnosis—thereby constraining the performance of vision–language models. To overcome this, the authors propose a self-supervised data augmentation method based on semantic clustering: unsupervised semantic clustering of radiology reports is employed to extract and incorporate diverse positive and neutral descriptions into the training data, and the resulting cluster information is integrated into the reward design of GRPO-based reinforcement learning. Experimental results demonstrate significant improvements across multiple evaluation metrics, with average gains of +5.63% in COMET, +3.04% in BertScore, +7.40% in Sentence BLEU, +5.30% in CheXbert-F1, and +7.47% in RadGraph-F1.
📝 Abstract
Medical vision-language datasets are often limited in size and biased toward negative findings, as clinicians report abnormalities mostly but might omit some positive/neutral findings because they might be considered as irrelevant to the patient's condition. We propose a self-supervised data enrichment method that leverages semantic clustering of report sentences. Then we enrich the findings in the medical reports in the training set by adding positive/neutral observations from different clusters in a self-supervised manner. Our approach yields consistent gains in supervised fine-tuning (5.63%, 3.04%, 7.40%, 5.30%, 7.47% average gains on COMET score, Bert score, Sentence Bleu, CheXbert-F1 and RadGraph-F1 scores respectively). Ablation studies confirm that improvements stem from semantic clustering rather than random augmentation. Furthermore, we introduce a way to incorporate semantic cluster information into the reward design for GRPO training, which leads to further performance gains (2.78%, 3.14%, 12.80% average gains on COMET score, Bert score and Sentence Bleu scores respectively). We share our code at https://anonymous.4open.science/r/SemEnrich-75CF