🤖 AI Summary
While tool-augmented chest X-ray agents enhance question-answering performance, they introduce subtle group-level biases that evade detection by end-to-end evaluation. This work proposes a stage-wise unfairness decomposition method within the MedRAX framework, disentangling bias sources into three interpretable mechanisms: tool exposure bias, tool switching bias, and model reasoning bias. Through staged fairness decomposition, conditional utility gap analysis, tool routing tracing, and reasoning trajectory comparison, we conduct a systematic audit across five backbone models. Our findings reveal end-to-end fairness gaps as high as 20.79%, with subgroup utility disparities reaching up to 50% when tools are available—highlighting the critical importance of process-level fairness auditing for the clinical deployment of intelligent agents.
📝 Abstract
Tool-using medical agents can improve chest X-ray question answering by orchestrating specialized vision and language modules, but this added pipeline complexity also creates new pathways for demographic bias beyond standalone models. We present ours (Decomposing Unfairness in Chest X-ray agents), a systematic audit of chest X-ray agents instantiated with MedRAX. To localize where disparities arise, we introduce a stage-wise fairness decomposition that separates end-to-end bias from three agent-specific sources: tool exposure bias (utility gaps conditioned on tool presence), tool transition bias (subgroup differences in tool-routing patterns), and model reasoning bias (subgroup differences in synthesis behaviors). Extensive experiments on tool-used based agentic frameworks across five driver backbones reveal that (i) demographic gaps persist in end-to-end performance, with equalized odds up to 20.79%, and the lowest fairness-utility tradeoff down to 28.65%, and (ii) intermediate behaviors, tool usage, transition patterns, and reasoning traces exhibit distinct subgroup disparities that are not predictable from end-to-end evaluation alone (e.g., conditioned on segmentation-tool availability, the subgroup utility gap reaches as high as 50%). Our findings underscore the need for process-level fairness auditing and debiasing to ensure the equitable deployment of clinical agentic systems. Code is available here: https://anonymous.4open.science/r/DUCK-E5FE/README.md