Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two key bottlenecks in scientific reasoning—explicit retrieval imposing a “tool tax” (redundant tokens and steps) and multi-agent averaging diluting high-quality solutions—this paper proposes an integrated framework combining implicit retrieval with structured collaboration. Methodologically, it introduces (1) monitor-guided implicit RAG for token-level knowledge injection, eliminating interference from explicit retrieval, and (2) hierarchical solution refinement (HSR) coupled with quality-aware iterative reasoning (QAIR), enabling adaptive multi-agent co-optimization while preserving high-fidelity solutions. Evaluated on HLE Bio/Chem Gold, the approach achieves 48.3% accuracy—surpassing the strongest baseline by 13.4 percentage points—while reducing token consumption by 53.5% and agent steps by 43.7%. Cross-domain robustness is further validated on SuperGPQA and TRQA.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden "tool tax" of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy -- the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5% and agent steps by 43.7%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: https://github.com/tangxiangru/Eigen-1.
Problem

Research questions and friction points this paper is trying to address.

Reducing token and step overhead from explicit retrieval in scientific reasoning
Preventing solution dilution through uniform multi-agent candidate averaging
Integrating external knowledge with minimal disruption to reasoning flow
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monitor-based token-level retrieval for implicit knowledge integration
Hierarchical Solution Refinement with anchor repair by peers
Quality-Aware Iterative Reasoning adapting to solution quality