Hallucination as Exploit: Evidence-Carrying Multimodal Agents

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the critical risk that multimodal agents, due to visual hallucinations, may trigger unauthorized privileged operations rather than merely generating incorrect responses. The authors propose Evidence-Carrying Agents (ECA), introducing the novel principle of “model proposes, evidence authorizes,” wherein model outputs are treated as untrusted proposals. ECA decomposes action-critical assertions and employs constrained DOM/OCR/AX verifiers to issue typed certificates, which a deterministic gating mechanism uses to permit only those operations substantiated by valid evidence. This approach uniquely reframes hallucination as an auditable authorization vulnerability and explicitly exposes perceptual residuals to enable targeted hardening. Experiments demonstrate that ECA achieves a 0% unsafe operation rate across 200-task end-to-end pipelines and a 120-task browser prototype, reduces red-team gating bypass rates from 15% to 1.3%, and completely blocks evidence-lacking hazardous executions in 500-layered audits.

📝 Abstract

Multimodal agents use screenshots, documents, and webpages to choose tool calls. When a false visual claim triggers a click, email, extraction, or transfer, hallucination becomes an authorization failure rather than an answer-quality error. We formalize this failure mode as hallucination-to-action conversion: an unsupported perceptual claim supplies the precondition that makes a privileged action appear permitted. We propose evidence-carrying multimodal agents (ECA), which treat free-form model text as inadmissible evidence. ECA decomposes each tool call into action-critical predicates, obtains typed certificates from constrained DOM/OCR/AX verifiers, and lets a deterministic gate grant only the privileges those certificates support. The architecture does not hide perception error; it converts opaque model belief into named verifier, schema, and implementation residuals. Verifier red-teaming over 1,900 attacks exposes this residual directly: four targeted hardening steps reduce gate bypass from 15% to 1.3%. With content-derived certificates, ECA obtains 0% unsafe-action rate on a 200-task end-to-end pipeline (Wilson 95% upper bound 2.67%) and a 120-task browser proof-of-concept (upper bound 4.3%). A direct HACR audit on 500 stratified task keys shows that unsupported action-critical claims reach unsafe execution for naive agents (100.0%) and prompt-only defense (49.6%), but not for ECA. Oracle-certificate replay on 7,488 GPT-5.4 benchmark traces serves as a gate-correctness sanity check, and neural judge baselines remain bypassable under the same threat model. The resulting principle is simple: model language may propose actions, but external evidence must authorize them.

Problem

Research questions and friction points this paper is trying to address.

hallucination

multimodal agents

authorization failure

tool call

evidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

evidence-carrying agents

hallucination-to-action conversion

multimodal verification