🤖 AI Summary
Large language models (LLMs) suffer from error accumulation and context corruption in deep research tasks due to rigid linear workflows (plan → search → write). Method: This paper proposes an explicitly controllable deep research framework featuring two novel, parameter-free control mechanisms: (1) a verifiable checklist module that imposes structured constraints on reasoning steps and enables automated validation; and (2) an evidence auditing module that supports real-time monitoring and correction via hierarchical goal decomposition, evidence binding, quality-based ranking, and LLM-based critical evaluation. Contribution/Results: The framework significantly improves task robustness, result verifiability, and traceability without fine-tuning. Experiments demonstrate state-of-the-art performance on deep research benchmarks, competitive results on deep search tasks, and substantial gains in output relevance and credibility.
📝 Abstract
Large language models are evolving from single-turn responders into tool-using agents capable of sustained reasoning and decision-making for deep research. Prevailing systems adopt a linear pipeline of plan to search to write to a report, which suffers from error accumulation and context rot due to the lack of explicit control over both model behavior and context. We introduce RhinoInsight, a deep research framework that adds two control mechanisms to enhance robustness, traceability, and overall quality without parameter updates. First, a Verifiable Checklist module transforms user requirements into traceable and verifiable sub-goals, incorporates human or LLM critics for refinement, and compiles a hierarchical outline to anchor subsequent actions and prevent non-executable planning. Second, an Evidence Audit module structures search content, iteratively updates the outline, and prunes noisy context, while a critic ranks and binds high-quality evidence to drafted content to ensure verifiability and reduce hallucinations. Our experiments demonstrate that RhinoInsight achieves state-of-the-art performance on deep research tasks while remaining competitive on deep search tasks.