🤖 AI Summary
Medical vision-language models often generate hallucinated chest X-ray reports, manifesting as fabricated, omitted, or mislocalized findings. This work proposes a fine-tuning-free, inference-stage intervention that applies token-level residual modulation to late-layer features using sparse autoencoders. The method leverages a universal “enhancement” direction alongside model-specific “suppression” directions to provide precise guidance during report generation. Analysis reveals that enhancement directions are consistent across models, whereas suppression directions require customization, informing a cross-model transfer strategy. Evaluated on MIMIC-CXR, the approach improves clinical composite scores by 5.4%, 7.2%, and 17.0% for RadVLM, LLaVA-Rad, and CheXOne, respectively. It also achieves a +7.7% gain in GREEN score under zero-shot transfer to IU-Xray.
📝 Abstract
Medical vision-language models (VLMs) often hallucinate findings when generating chest X-ray reports: they fabricate findings that are not present in the image, miss important ones, or locate them incorrectly. We mitigate this without weight updates by decoding-time residual steering on a per-token sparse autoencoder (SAE) basis: Top-$K$ SAEs on late layers, causal steering against clinical errors, then combined suppress/boost intervention at inference time. On the MIMIC-CXR test split, our inference-only method improves the quality of generated reports for three radiology VLMs (RadVLM, LLaVA-Rad, and CheXOne), with relative improvements of +5.4%, +7.2%, and +17.0% in the clinical composite metric, and statistically significant GREEN gains on all backbones. A cross-model feature alignment shows that the quality-promoting (boost) directions overlap strongly across architectures, whereas hallucination-linked (suppress) directions are model-specific. Therefore, transferable steering must treat suppression per-backbone, rather than sharing a universal suppress list. The same recipe transfers zero-shot to IU-Xray (Green $+7.7\%$ rel.) without retraining, confirming that the identified features are properties of the model, not of the training corpus. We release causal feature sets and an interactive feature dashboard: https://cxr-sparse-feature-dashboard.netlify.app/.