🤖 AI Summary
Generative search engines and deep-research LLM agents frequently exhibit overconfidence, citation disorganization, and insufficient evidential support. To address these reliability challenges, this paper introduces the first end-to-end socio-technical audit framework explicitly designed for reliability assessment, featuring eight orthogonal evaluation dimensions. We propose a novel sentence-level decomposition and joint confidence modeling approach to construct a “citation–factual support matrix,” enabling systematic tracing and quantitative evaluation of AI reasoning chains. Leveraging an automated extraction pipeline, human-validated LLM adjudicators, and matrix-based analytics, we conduct large-scale empirical evaluations across multiple publicly available models. Results reveal that mainstream generative search systems exhibit high-confidence but low-evidential-support behavior; while deep-research modes improve citation completeness, they suffer from substantial bias—citation accuracy varies widely (40%–80%) across queries and domains.
📝 Abstract
Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40--80% across systems.