Hallucination as Exploit: Evidence-Carrying Multimodal Agents

πŸ“… 2026-05-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

217K/year
πŸ€– AI Summary
This work addresses the critical risk that multimodal agents, due to visual hallucinations, may trigger unauthorized privileged operations rather than merely generating incorrect responses. The authors propose Evidence-Carrying Agents (ECA), introducing the novel principle of β€œmodel proposes, evidence authorizes,” wherein model outputs are treated as untrusted proposals. ECA decomposes action-critical assertions and employs constrained DOM/OCR/AX verifiers to issue typed certificates, which a deterministic gating mechanism uses to permit only those operations substantiated by valid evidence. This approach uniquely reframes hallucination as an auditable authorization vulnerability and explicitly exposes perceptual residuals to enable targeted hardening. Experiments demonstrate that ECA achieves a 0% unsafe operation rate across 200-task end-to-end pipelines and a 120-task browser prototype, reduces red-team gating bypass rates from 15% to 1.3%, and completely blocks evidence-lacking hazardous executions in 500-layered audits.
πŸ“ Abstract
Multimodal agents use screenshots, documents, and webpages to choose tool calls. When a false visual claim triggers a click, email, extraction, or transfer, hallucination becomes an authorization failure rather than an answer-quality error. We formalize this failure mode as hallucination-to-action conversion: an unsupported perceptual claim supplies the precondition that makes a privileged action appear permitted. We propose evidence-carrying multimodal agents (ECA), which treat free-form model text as inadmissible evidence. ECA decomposes each tool call into action-critical predicates, obtains typed certificates from constrained DOM/OCR/AX verifiers, and lets a deterministic gate grant only the privileges those certificates support. The architecture does not hide perception error; it converts opaque model belief into named verifier, schema, and implementation residuals. Verifier red-teaming over 1,900 attacks exposes this residual directly: four targeted hardening steps reduce gate bypass from 15% to 1.3%. With content-derived certificates, ECA obtains 0% unsafe-action rate on a 200-task end-to-end pipeline (Wilson 95% upper bound 2.67%) and a 120-task browser proof-of-concept (upper bound 4.3%). A direct HACR audit on 500 stratified task keys shows that unsupported action-critical claims reach unsafe execution for naive agents (100.0%) and prompt-only defense (49.6%), but not for ECA. Oracle-certificate replay on 7,488 GPT-5.4 benchmark traces serves as a gate-correctness sanity check, and neural judge baselines remain bypassable under the same threat model. The resulting principle is simple: model language may propose actions, but external evidence must authorize them.
Problem

Research questions and friction points this paper is trying to address.

hallucination
multimodal agents
authorization failure
tool call
evidence
Innovation

Methods, ideas, or system contributions that make the work stand out.

evidence-carrying agents
hallucination-to-action conversion
multimodal verification
privileged action authorization
deterministic gating
πŸ’Ό Related Jobs
G
Guijia Zhang
Shenzhen University
H
Hao Zheng
Shenzhen University
Harry Yang
Harry Yang
HKUST
computer visionmachine learning