LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing methods in fine-grained representation learning for chest X-rays, which suffer from a lack of region-level supervision and an inability of large vision-language models to effectively capture fine-grained details during external validation, leading to suboptimal performance in image retrieval and phrase grounding. To overcome these challenges, the authors propose a novel framework that provides region-level supervision without requiring manual annotations. The approach jointly optimizes Sigmoid loss, image captioning loss, and position-aware captioning loss, integrating a lightweight large language model with contrastive learning to generate dense, spatially aware image descriptions. Furthermore, a fine-grained encoder is introduced to enhance phrase grounding within retrieval-augmented in-context learning. Experiments on the MIMIC-CXR and PadChest-GR datasets demonstrate that the proposed method significantly outperforms current state-of-the-art approaches in both chest X-ray retrieval and phrase grounding tasks.

Technology Category

Application Category

📝 Abstract
Fine-grained representation learning is crucial for retrieval and phrase grounding in chest X-rays, where clinically relevant findings are often spatially confined. However, the lack of region-level supervision in contrastive models and the limited ability of large vision language models to capture fine-grained representations in external validation lead to suboptimal performance on these tasks. To address these limitations, we propose Location-aware Fine-grained representation learning (LoFi), which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives, thereby facilitating fine-grained representation learning. Building upon these representations, we integrate a fine-grained encoder into retrieval-based in-context learning to enhance chest X-ray grounding across diverse settings. Extensive experiments demonstrate that our method achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR.
Problem

Research questions and friction points this paper is trying to address.

fine-grained representation learning
chest X-ray
phrase grounding
region-level supervision
retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained representation learning
location-aware captioning
chest X-ray grounding
retrieval-based in-context learning
region-level supervision
🔎 Similar Papers
No similar papers found.
M
Myeongkyun Kang
The University of British Columbia, Vancouver, BC V6T 1Z4, Canada
Y
Yanting Yang
The University of British Columbia, Vancouver, BC V6T 1Z4, Canada; Vector Institute, Toronto, ON M5G 0C6, Canada
Xiaoxiao Li
Xiaoxiao Li
Assistant Professor, UBC; Vector Institute; CIFAR AI Chair; Canada Research Chair
Deep LearningTrustworthy AIAI for Healthcare