DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Generative search engines and deep-research LLM agents frequently exhibit overconfidence, citation disorganization, and insufficient evidential support. To address these reliability challenges, this paper introduces the first end-to-end socio-technical audit framework explicitly designed for reliability assessment, featuring eight orthogonal evaluation dimensions. We propose a novel sentence-level decomposition and joint confidence modeling approach to construct a “citation–factual support matrix,” enabling systematic tracing and quantitative evaluation of AI reasoning chains. Leveraging an automated extraction pipeline, human-validated LLM adjudicators, and matrix-based analytics, we conduct large-scale empirical evaluations across multiple publicly available models. Results reveal that mainstream generative search systems exhibit high-confidence but low-evidential-support behavior; while deep-research modes improve citation completeness, they suffer from substantial bias—citation accuracy varies widely (40%–80%) across queries and domains.

Technology Category

Application Category

📝 Abstract

Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40--80% across systems.

Problem

Research questions and friction points this paper is trying to address.

Auditing AI systems for reliability across citations

Measuring overconfidence and weak sourcing in responses

Evaluating citation accuracy and evidence support in outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Statement-level analysis with decomposition and confidence scoring

Citation and factual-support matrices for evidence auditing

Automated extraction pipelines and LLM-judge evaluation

🔎 Similar Papers

SoK: Machine Learning for Misinformation Detection