🤖 AI Summary
This study addresses the challenge of uncovering latent, unlabeled clinical image–attribute causal relationships in chest X-ray (CXR) images using vision–language models (VLMs). Conventional structural causal models (SCMs) suffer from low spatial resolution, poor editing fidelity, and coarse-grained metadata, limiting their ability to identify critical data characteristics. To overcome these limitations, we propose— for the first time—the fine-tuning of CLIP- and Flamingo-style VLMs for attribute inversion, integrated with a causal inference–driven disentanglement strategy and a bias diagnostic framework. Experiments demonstrate that our method achieves state-of-the-art performance over existing SCMs in attribute-controllable generation fidelity, implicit association discovery, and spurious correlation identification. It successfully reveals multiple clinically meaningful yet unlabeled image–attribute combinations. Moreover, it quantifies VLMs’ sensitivity to biases and generalization constraints in fine-grained image editing.
📝 Abstract
Vision-language foundation models (VLMs) have shown impressive performance in guiding image generation through text, with emerging applications in medical imaging. In this work, we are the first to investigate the question: 'Can fine-tuned foundation models help identify critical, and possibly unknown, data properties?' By evaluating our proposed method on a chest x-ray dataset, we show that these models can generate high-resolution, precisely edited images compared to methods that rely on Structural Causal Models (SCMs) according to numerous metrics. For the first time, we demonstrate that fine-tuned VLMs can reveal hidden data relationships that were previously obscured due to available metadata granularity and model capacity limitations. Our experiments demonstrate both the potential of these models to reveal underlying dataset properties while also exposing the limitations of fine-tuned VLMs for accurate image editing and susceptibility to biases and spurious correlations.