🤖 AI Summary
This study addresses the veracity of claims generated by multimodal generative search systems when citing videos as evidence. Conducting a large-scale audit of Gemini 2.5 Pro, the authors analyze 11,943 claim–video pairs across medical, economic, and general domains, revealing for the first time a phenomenon of “precise yet unsupported” detail injection. Claim support is automatically evaluated using three large language models validated by human annotators (inter-annotator agreement: 87.7%), complemented by logistic regression for attribution analysis. Findings indicate that 3.7%–18.7% of generated claims lack video support, primarily manifesting as unverifiable specifics or exaggerated statements. The risk of unsupported claims significantly increases when generated assertions diverge from source terminology or exhibit low semantic similarity to the referenced video content.
📝 Abstract
Multimodal Large Language Models (MLLMs) increasingly function as generative search systems that retrieve and synthesize answers from multimedia content, including YouTube videos. Although these systems project authority by citing specific videos as evidence, the extent to which these citations genuinely substantiate the generated claims remains unexamined. We present a large-scale audit of the Gemini 2.5 Pro multimodal search system, analyzing 11,943 claim-video pairs generated across Medical, Economic, and General domains. Through automated verification using three independent LLM judges (87.7% inter-rater agreement), validated against human annotations, we find that depending on the judge's strictness, between 3.7% and 18.7% of video-grounded claims are not supported by their cited sources. The dominant failure modes are not outright contradictions but rather unverifiable specificities and overstated claims, suggesting the system injects precise but ungrounded details from parametric knowledge while citing videos as evidence. Exploratory post-hoc analysis via logistic regression reveals properties associated with these failures: claims departing from source vocabulary ($β= -1.6$ to $-3.1$, $p < 0.01$) and claims with low semantic similarity to the video transcript ($β= -2.1$ to $-11.6$, $p < 0.01$) are significantly more likely to be unsupported. These findings characterize the current trustworthiness of video-based generative search and highlight the gap between the confidence these systems project and the fidelity of their outputs.