🤖 AI Summary
This study addresses the inefficiency and poor auditability of clinical multimodal hypothesis testing caused by fragmented workflows. The authors propose a multi-agent collaborative reasoning framework that automatically translates natural language hypotheses into end-to-end traceable statistical analysis pipelines, establishing a complete evidence chain from image segmentation to statistical inference. The framework introduces a novel four-category conclusion taxonomy—supporting, refuting, insufficient power, and invalid—based on cognitive evidence labels. It employs role-specific agents that operate in stages, synergistically integrating state-of-the-art large models with local open-source models to generate verifiable analysis plans, segmentation masks, statistical code, and conclusions. Evaluated on a medical imaging benchmark comprising 64 hypotheses, the method achieves 81.4% judgment accuracy and an 86.6% rate of verifiable outputs, significantly outperforming single-model baselines.
📝 Abstract
Drawing meaningful conclusions from inherently multimodal clinical data (including medical imaging) requires coordinating expertise across the clinical specialty, radiology, programming, and biostatistics. This fragmented process bottlenecks discovery. We present VERITAS (Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems), a multi-agent system that autonomously tests natural-language hypotheses on multimodal clinical datasets while producing a fully auditable evidence trail: every statistical conclusion traces through inspectable, executable outputs from analysis plan to segmentation masks to statistical code to final verdict. VERITAS decomposes the workflow into four phases handled by role-specialized agents, and introduces an epistemic evidence label framework that mechanically classifies outcomes as Supported, Refuted, Underpowered, or Invalid by jointly evaluating significance, effect direction, and study power. This distinction is critical in medical imaging, where non-significant results often reflect insufficient sample size rather than absent effects. To evaluate the system, we construct a tiered benchmark of 64 hypotheses spanning six complexity levels across cardiac (ACDC, 150 subjects) and brain glioma (UCSF-PDGM, 501 subjects) MRI. VERITAS reaches 81.4% verdict accuracy with frontier models and 71.2% with locally-hosted open-weight models (8-30B), outperforming all five single-model baselines in both classes. It also produces the highest rate of independently verifiable statistical outputs (86.6%), so even its failures remain diagnosable through artifact inspection. Structured multi-agent decomposition thus substitutes for model scale while preserving the verifiability clinical research demands.