🤖 AI Summary
High-quality coding of clinical doctor–patient dialogues faces challenges including high manual annotation costs, poor inter-annotator consistency, and limited cross-domain generalizability. To address these, we propose MOSAIC, a LangGraph-based multi-agent system featuring a novel four-agent collaborative architecture—codebook selection, dynamic codebook updating, label generation, and consistency verification—integrated with retrieval-augmented generation (RAG), dynamic few-shot prompting, and multi-agent coordination. MOSAIC enables automated, interpretable, and adaptive dialogue coding across multiple clinical communication frameworks. Evaluated on 50 real-world clinician–patient transcripts, it achieves an overall F1-score of 0.928; on the rheumatology subset, F1 reaches 0.962, with particularly strong performance in patient behavior identification. This work significantly enhances the scalability, reliability, and cross-domain applicability of clinical communication analysis.
📝 Abstract
Clinical communication is central to patient outcomes, yet large-scale human annotation of patient-provider conversation remains labor-intensive, inconsistent, and difficult to scale. Existing approaches based on large language models typically rely on single-task models that lack adaptability, interpretability, and reliability, especially when applied across various communication frameworks and clinical domains. In this study, we developed a Multi-framework Structured Agentic AI system for Clinical Communication (MOSAIC), built on a LangGraph-based architecture that orchestrates four core agents, including a Plan Agent for codebook selection and workflow planning, an Update Agent for maintaining up-to-date retrieval databases, a set of Annotation Agents that applies codebook-guided retrieval-augmented generation (RAG) with dynamic few-shot prompting, and a Verification Agent that provides consistency checks and feedback. To evaluate performance, we compared MOSAIC outputs against gold-standard annotations created by trained human coders. We developed and evaluated MOSAIC using 26 gold standard annotated transcripts for training and 50 transcripts for testing, spanning rheumatology and OB/GYN domains. On the test set, MOSAIC achieved an overall F1 score of 0.928. Performance was highest in the Rheumatology subset (F1 = 0.962) and strongest for Patient Behavior (e.g., patients asking questions, expressing preferences, or showing assertiveness). Ablations revealed that MOSAIC outperforms baseline benchmarking.