🤖 AI Summary
Existing autonomous scientific research systems often produce unverifiable outputs, including fabricated citations, irreproducible results, and inconsistencies between described methods and implemented code. To address this, this work proposes the Chain-of-Evidence (CoE) framework and implements ScientistOne, an end-to-end autonomous research system that maintains rigorous binding between scientific claims and their original evidence throughout literature review, protocol discovery, and paper writing. We introduce a novel paradigm where system architecture inherently enforces evidentiary consistency and develop a four-dimensional CoE Audit mechanism—comprising score verification, norm violation detection, citation validation, and method-code alignment—for unified integrity assessment. Experiments demonstrate that ScientistOne matches or exceeds human expert performance across five frontier tasks, achieving zero fabricated citations (0/337), 100% score verification success (12/12), and the highest method-code alignment rate (14/15), while securing state-of-the-art or MLE-Bench gold-level results on six novel tasks.
📝 Abstract
Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.