π€ AI Summary
Existing autonomous scientific discovery systems predominantly rely on single-agent linear workflows, which struggle to address core challenges such as multi-perspective hypothesis generation, iterative recovery from experimental failures, and cross-iteration knowledge accumulation. This work proposes a multi-agent autonomous research framework that employs structured debate to generate and validate hypotheses, integrates a Pivot/Refine-based self-repair execution mechanism to transform failed experiments into actionable knowledge, and introduces a seven-level human-AI collaboration protocol enabling precise intervention. The system incorporates verifiable result reporting and cross-iteration experience evolution, achieving a 54.7% performance improvement over AI Scientist v2 on the ARC-Bench benchmark. These results demonstrate that strategic human-AI collaboration at critical decision points significantly outperforms both fully autonomous and fully supervised paradigms.
π Abstract
Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this process as a linear pipeline: they rely on single-agent reasoning, stop when execution fails, and do not carry experience across runs. We present AutoResearchClaw, a multi-agent autonomous research pipeline built on five mechanisms: structured multi-agent debate for hypothesis generation and result analysis, a self-healing executor with a \textsc{Pivot}/\textsc{Refine} decision loop that transforms failures into information, verifiable result reporting that prevents fabricated numbers and hallucinated citations, human-in-the-loop collaboration with seven intervention modes spanning full autonomy to step-by-step oversight, and cross-run evolution that converts past mistakes into future safeguards. On ARC-Bench, a 25-topic experiment-stage benchmark, AutoResearchClaw outperforms AI Scientist v2 by 54.7%. A human-in-the-loop ablation across seven intervention modes reveals that precise, targeted collaboration at high-leverage decision points consistently outperforms both full autonomy and exhaustive step-by-step oversight. We position AutoResearchClaw as a research amplifier that augments rather than replaces human scientific judgment. Code is available at https://github.com/aiming-lab/AutoResearchClaw.