DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generative search engines and deep-research LLM agents frequently exhibit overconfidence, citation disorganization, and insufficient evidential support. To address these reliability challenges, this paper introduces the first end-to-end socio-technical audit framework explicitly designed for reliability assessment, featuring eight orthogonal evaluation dimensions. We propose a novel sentence-level decomposition and joint confidence modeling approach to construct a “citation–factual support matrix,” enabling systematic tracing and quantitative evaluation of AI reasoning chains. Leveraging an automated extraction pipeline, human-validated LLM adjudicators, and matrix-based analytics, we conduct large-scale empirical evaluations across multiple publicly available models. Results reveal that mainstream generative search systems exhibit high-confidence but low-evidential-support behavior; while deep-research modes improve citation completeness, they suffer from substantial bias—citation accuracy varies widely (40%–80%) across queries and domains.

Technology Category

Application Category

📝 Abstract
Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40--80% across systems.
Problem

Research questions and friction points this paper is trying to address.

Auditing AI systems for reliability across citations
Measuring overconfidence and weak sourcing in responses
Evaluating citation accuracy and evidence support in outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Statement-level analysis with decomposition and confidence scoring
Citation and factual-support matrices for evidence auditing
Automated extraction pipelines and LLM-judge evaluation
🔎 Similar Papers
No similar papers found.
P
Pranav Narayanan Venkit
Salesforce AI Research, Palo Alto, CA 94301, USA
Philippe Laban
Philippe Laban
Senior Research Scientist, Microsoft Research
nlphcifactualityllm evaluationsummarization
Yilun Zhou
Yilun Zhou
Massachusetts Institute of Technology
Machine LearningRobotics
K
Kung-Hsiang Huang
Salesforce AI Research, Palo Alto, CA 94301, USA
Yixin Mao
Yixin Mao
Salesforce AI Research, Palo Alto, CA 94301, USA
C
Chien-Sheng Wu
Salesforce AI Research, Palo Alto, CA 94301, USA