🤖 AI Summary
This study addresses the challenge of detecting latent biases in large language models (LLMs) deployed in healthcare—a critical yet underexplored fairness concern. We propose the first multi-hop adversarial probing framework that integrates medical knowledge graphs (KGs) with auxiliary LLMs. Our method constructs a structured, semantically grounded KG to encode domain-specific relationships, applies targeted adversarial perturbations to expose hidden biases, and leverages multi-hop reasoning jointly with an auxiliary LLM to identify subtle, cross-entity and cross-attribute bias patterns. Evaluated across three medical benchmarks, six state-of-the-art LLMs, and five bias categories (e.g., gender, race, geography), our approach significantly outperforms existing baselines. It achieves substantial improvements in bias detection rate, interpretability, and cross-model generalizability—establishing a scalable, automated paradigm for fairness assessment in medical AI.
📝 Abstract
Large language models (LLMs) that are used in medical applications are known to show biased and unfair patterns. Prior to adopting these in clinical decision-making applications, it is crucial to identify these bias patterns to enable effective mitigation of their impact. In this study, we present a novel framework combining knowledge graphs (KGs) with auxiliary LLMs to systematically reveal complex bias patterns in medical LLMs. Specifically, the proposed approach integrates adversarial perturbation techniques to identify subtle bias patterns. The approach adopts a customized multi-hop characterization of KGs to enhance the systematic evaluation of arbitrary LLMs. Through a series of comprehensive experiments (on three datasets, six LLMs, and five bias types), we show that our proposed framework has noticeably greater ability and scalability to reveal complex biased patterns of LLMs compared to other baselines.