🤖 AI Summary
Neuroimaging research is often hindered by heterogeneous multimodal data, complex processing pipelines, and poor reproducibility. This work proposes NeuroClaw, a domain-specific multi-agent research assistant for neuroimaging that leverages a three-tier hierarchical architecture to parse BIDS metadata and automatically decompose and execute end-to-end workflows. The system integrates Docker containerization, locked Python environments, automatic GPU configuration, and automated toolchain orchestration to ensure full auditability and reproducibility across the entire pipeline. Evaluated on the NeuroBench benchmark, NeuroClaw significantly outperforms direct large language model invocation, achieving marked improvements in execution success rate, output validity, and reproducibility.
📝 Abstract
Agentic artificial intelligence systems promise to accelerate scientific workflows, but neuroimaging poses unique challenges: heterogeneous modalities (sMRI, fMRI, dMRI, EEG), long multi-stage pipelines, and persistent reproducibility risks. To address this gap, we present NeuroClaw, a domain-specialized multi-agent research assistant for executable and reproducible neuroimaging research. NeuroClaw operates directly on raw neuroimaging data across formats and modalities, grounding decisions in dataset semantics and BIDS metadata so users need not prepare curated inputs or bespoke model code. The platform combines harness engineering with end-to-end environment management, including pinned Python environments, Docker support, automated installers for common neuroimaging tools, and GPU configuration. In practice, this layer emphasizes checkpointing, post-execution verification, structured audit traces, and controlled runtime setup, making toolchains more transparent while improving reproducibility and auditability. A three-tier skill/agent hierarchy separates user-facing interaction, high-level orchestration, and low-level tool skills to decompose complex workflows into safe, reusable units. Alongside the NeuroClaw framework, we introduce NeuroBench, a system-level benchmark for executability, artifact validity, and reproducibility readiness. Across multiple multimodal LLMs, NeuroClaw-enabled runs yield consistent and substantial score improvements compared with direct agent invocation. Project homepage: https://cuhk-aim-group.github.io/NeuroClaw/index.html