🤖 AI Summary
Molecular Tumor Boards (MTBs) rely on manually curated patient summaries, which suffer from low efficiency, high subjectivity, and frequent omission of critical information; conventional automated summarization evaluation methods are hindered by lexical variation and fail to objectively assess comprehensiveness and conciseness. To address these challenges, we propose HAO—a large language model–based multi-agent collaborative framework that automatically synthesizes structured patient histories from heterogeneous clinical data. We further introduce TBFact, a novel evaluation paradigm wherein the LLM itself acts as an impartial “judge,” enabling local, reference-free, and privacy-preserving quality assessment without sharing sensitive patient data. Evaluated on real-world MTB consultation data, HAO achieves 94% coverage of high-importance clinical information, and TBFact attains a strict recall score of 0.84, demonstrating substantial improvements in summary completeness, consistency, and evaluable objectivity.
📝 Abstract
Molecular Tumor Boards (MTBs) are multidisciplinary forums where oncology specialists collaboratively assess complex patient cases to determine optimal treatment strategies. A central element of this process is the patient summary, typically compiled by a medical oncologist, radiation oncologist, or surgeon, or their trained medical assistant, who distills heterogeneous medical records into a concise narrative to facilitate discussion. This manual approach is often labor-intensive, subjective, and prone to omissions of critical information. To address these limitations, we introduce the Healthcare Agent Orchestrator (HAO), a Large Language Model (LLM)-driven AI agent that coordinates a multi-agent clinical workflow to generate accurate and comprehensive patient summaries for MTBs. Evaluating predicted patient summaries against ground truth presents additional challenges due to stylistic variation, ordering, synonym usage, and phrasing differences, which complicate the measurement of both succinctness and completeness. To overcome these evaluation hurdles, we propose TBFact, a ``model-as-a-judge'' framework designed to assess the comprehensiveness and succinctness of generated summaries. Using a benchmark dataset derived from de-identified tumor board discussions, we applied TBFact to evaluate our Patient History agent. Results show that the agent captured 94% of high-importance information (including partial entailments) and achieved a TBFact recall of 0.84 under strict entailment criteria. We further demonstrate that TBFact enables a data-free evaluation framework that institutions can deploy locally without sharing sensitive clinical data. Together, HAO and TBFact establish a robust foundation for delivering reliable and scalable support to MTBs.