🤖 AI Summary
Current large language models lack the native spatial reasoning capabilities required to directly analyze three-dimensional medical images such as brain MRI, limiting their utility in complex neuroradiological tasks. This work proposes a training-free agent framework that enables large language models to autonomously perform end-to-end brain MRI analysis by orchestrating external specialized tools—including skull stripping, image registration, and tumor segmentation modules—spanning preprocessing, lesion segmentation, volumetric assessment, and longitudinal treatment response evaluation across multiple timepoints. We demonstrate for the first time the feasibility of training-free agents in high-complexity neuroimaging analysis, introducing both single-agent and multi-expert collaborative mechanisms, and release the first BraTS-based image-prompt-answer evaluation benchmark. Experiments show that the approach efficiently executes these tasks on models such as GPT-5.1, Gemini 3 Pro, and Claude Sonnet 4.5 without any fine-tuning.
📝 Abstract
State-of-the-art large language models (LLMs) show high performance in general visual question answering. However, a fundamental limitation remains: current architectures lack the native 3D spatial reasoning required for direct analysis of volumetric medical imaging, such as CT or MRI. Emerging agentic AI offers a new solution, eliminating the need for intrinsic 3D processing by enabling LLMs to orchestrate and leverage specialized external tools. Yet, the feasibility of such agentic frameworks in complex, multi-step radiological workflows remains underexplored. In this work, we present a training-free agentic pipeline for automated brain MRI analysis. Validating our methodology on several LLMs (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5) with off-the-shelf domain-specific tools, our system autonomously executes complex end-to-end workflows, including preprocessing (skull stripping, registration), pathology segmentation (glioma, meningioma, metastases), and volumetric analysis. We evaluate our framework across increasingly complex radiological tasks, from single-scan segmentation and volumetric reporting to longitudinal response assessment requiring multi-timepoint comparisons. We analyze the impact of architectural design by comparing single-agent models against multi-agent "domain-expert" collaborations. Finally, to support rigorous evaluation of future agentic systems, we introduce and release a benchmark dataset of image-prompt-answer tuples derived from public BraTS data. Our results demonstrate that agentic AI can solve highly neuro-radiological image analysis tasks through tool use without the need for training or fine-tuning.