🤖 AI Summary
This study addresses the challenge in ultrasound interpretation of simultaneously achieving precise lesion localization and holistic clinical reasoning—a balance often unattained by existing methods. To this end, the authors propose Echo-α, the first framework to introduce agent-based multimodal reasoning into ultrasound analysis. Echo-α employs an invoke-and-reason architecture that integrates organ-specific detectors with global visual context, further enhanced by a nine-task supervised curriculum and multi-reward sequential reinforcement learning to jointly optimize lesion grounding and diagnostic decision-making. Evaluated on multicenter renal and breast ultrasound datasets, Echo-α achieves lesion localization F1@0.5 scores of 56.73% and 43.78% (Echo-α-Grounding) and overall diagnostic accuracies of 74.90% and 49.20% (Echo-α-Diagnosis), respectively.
📝 Abstract
Ultrasound interpretation requires both precise lesion localization and holistic clinical reasoning, yet existing methods typically excel at only one of these capabilities: specialized detectors offer strong localization but limited reasoning, whereas multimodal large language models (MLLMs) provide flexible reasoning but weak grounding in specialized medical domains. We present Echo-α, an agentic multimodal reasoning model for ultrasound interpretation that unifies these strengths within an invoke-and-reason framework. Echo-α is trained to coordinate organ-specific detector outputs, integrate them with global visual context, and convert the resulting evidence into grounded diagnostic decisions beyond detector-only inference. This behavior is established through a nine-task supervised curriculum and then refined by sequential reinforcement learning under different reward trade-offs, yielding Echo-α-Grounding for lesion anchoring and Echo-α-Diagnosis for final diagnosis. On multi-center renal and breast ultrasound benchmarks, Echo-α outperforms competitive baselines on both grounding and diagnosis. In particular, on cross-center test sets, Echo-α-Grounding attains 56.73%/43.78% F1@0.5 and Echo- α-Diagnosis reaches 74.90%/49.20% overall accuracy on renal/breast ultrasound. These results suggest that agentic multimodal reasoning can turn specialized detectors into verifiable clinical evidence, offering a practical route toward ultrasound AI systems that are more accurate, interpretable, and transferable. The repository is at https://github.com/MiliLab/Echo-Alpha.