🤖 AI Summary
Preference fine-tuning for chest X-ray report generation is hindered by the absence of radiologist-provided feedback, making expert annotation prohibitively expensive and impractical.
Method: We propose a novel annotation-free preference modeling paradigm that automatically constructs preference signals from publicly available image-report pairs using multi-dimensional reference metrics (e.g., CheXbert). To mitigate reward over-optimization, we design a length-controllable GREEN scoring function and integrate it into a vision-language model reinforced via a reward modeling framework that explicitly enforces report length constraints.
Contribution/Results: On MIMIC-CXR, our method achieves state-of-the-art CheXbert scores—indicating superior semantic fidelity—while maintaining strong generalization and robustness across six image-perception and clinical-reasoning tasks. This work establishes a scalable, low-supervision pathway for preference optimization in medical AI, eliminating reliance on scarce expert annotations while preserving clinical validity and controllability.
📝 Abstract
Radiologists play a crucial role in translating medical images into actionable reports. However, the field faces staffing shortages and increasing workloads. While automated approaches using vision-language models (VLMs) show promise as assistants, they require exceptionally high accuracy. Most current VLMs in radiology rely solely on supervised fine-tuning. Meanwhile, additional preference fine-tuning in the post-training pipeline has become standard practice in the general domain. The challenge in radiology lies in the prohibitive cost of obtaining radiologist feedback at scale. To address this challenge, we propose an automated pipeline for preference feedback, focusing on chest X-ray radiology report generation (RRG). Specifically, our method leverages publicly available datasets containing pairs of images and radiologist-written reference reports with reference-based metrics, or Judges, eliminating the need for additional radiologist feedback. We investigate reward overoptimization via length exploitation in this setting and introduce a length-controlled version of the GREEN score. Our best-performing setup achieves state-of-the-art CheXbert scores on the MIMIC-CXR dataset for the RRG task while on average maintaining robust performance across six additional image perception and reasoning tasks.