CXRAgent: Director-Orchestrated Multi-Stage Reasoning for Chest X-Ray Interpretation

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current chest X-ray (CXR) interpretation models suffer from poor generalizability and weak reasoning capabilities, while large language model (LLM)-based agents lack mechanisms for evaluating tool reliability, undermining clinical trustworthiness. To address these limitations, we propose CXRAgent: a multi-stage, central-commander-driven intelligent agent framework for CXR interpretation. It integrates LLMs, domain-specific medical image analysis tools, an Evidence-Driven Verifier (EDV), contextual memory, and role-specialized expert agents for collaborative reasoning. Our key innovations include a dynamic commander mechanism that orchestrates adaptive expert team formation and an EDV module enabling multi-tool reliability assessment and visualization-supported, evidence-grounded diagnostic consensus. Experiments demonstrate that CXRAgent significantly improves diagnostic accuracy, interpretability, and cross-task generalization across diverse CXR benchmarks, generating traceable, multimodal (visual-textual) evidential outputs.

Technology Category

Application Category

📝 Abstract
Chest X-ray (CXR) plays a pivotal role in clinical diagnosis, and a variety of task-specific and foundation models have been developed for automatic CXR interpretation. However, these models often struggle to adapt to new diagnostic tasks and complex reasoning scenarios. Recently, LLM-based agent models have emerged as a promising paradigm for CXR analysis, enhancing model's capability through tool coordination, multi-step reasoning, and team collaboration, etc. However, existing agents often rely on a single diagnostic pipeline and lack mechanisms for assessing tools' reliability, limiting their adaptability and credibility. To this end, we propose CXRAgent, a director-orchestrated, multi-stage agent for CXR interpretation, where a central director coordinates the following stages: (1) Tool Invocation: The agent strategically orchestrates a set of CXR-analysis tools, with outputs normalized and verified by the Evidence-driven Validator (EDV), which grounds diagnostic outputs with visual evidence to support reliable downstream diagnosis; (2) Diagnostic Planning: Guided by task requirements and intermediate findings, the agent formulates a targeted diagnostic plan. It then assembles an expert team accordingly, defining member roles and coordinating their interactions to enable adaptive and collaborative reasoning; (3) Collaborative Decision-making: The agent integrates insights from the expert team with accumulated contextual memories, synthesizing them into an evidence-backed diagnostic conclusion. Experiments on various CXR interpretation tasks show that CXRAgent delivers strong performance, providing visual evidence and generalizes well to clinical tasks of different complexity. Code and data are valuable at this href{https://github.com/laojiahuo2003/CXRAgent/}{link}.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of single-pipeline CXR agents lacking reliability assessment
Enhances diagnostic adaptability through multi-stage orchestration and tool validation
Improves complex reasoning via evidence-backed team collaboration and memory integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Director-orchestrated multi-stage reasoning for CXR interpretation
Evidence-driven validator verifies tool outputs with visual evidence
Assembles expert teams for adaptive collaborative diagnostic planning
🔎 Similar Papers
No similar papers found.
J
Jinhui Lou
School of Computer Science, Hangzhou Dianzi University, Hangzhou, 310018, China
Y
Yan Yang
School of Computer Science, Hangzhou Dianzi University, Hangzhou, 310018, China
Z
Zhou Yu
School of Computer Science, Hangzhou Dianzi University, Hangzhou, 310018, China
Zhenqi Fu
Zhenqi Fu
Tsinghua University
low-level visionbiomedical imagingdeep learning
Weidong Han
Weidong Han
Tencent Inc., School of Data Science, Fudan University
Large Language ModelNLPMulti-Modal
Qingming Huang
Qingming Huang
University of the Chinese Academy of Sciences
Multimedia Analysis and RetrievalImage and Video ProcessingPattern RecognitionComputer VisionVideo Coding
J
Jun Yu
School of Intelligence Science and Engineering, Harbin Institute of Technology (Shenzhen), 518055, China