Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a novel failure mode in multi-agent vision-language models—“visual hallucination snowballing”: initial visual hallucinations in one agent are progressively amplified across subsequent agents due to overreliance on textual information flow, causing sustained degradation of cross-agent visual attention. To address this, we propose ViF, a lightweight, plug-and-play visual stream enhancement paradigm. ViF leverages the critical role of mid-layer unimodal visual tokens by introducing a visual relay token mechanism and a cross-agent attention reallocation strategy. Through turn-level, hierarchical, and token-level attention analysis, ViF constructs vision-perception-driven message-passing pathways. Evaluated across four multi-agent architectures, ten foundational models, and eight benchmark tasks, ViF effectively suppresses hallucination cascades and consistently improves overall system performance.

Technology Category

Application Category

📝 Abstract
Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code will be available at: https://github.com/YU-deep/ViF.git.
Problem

Research questions and friction points this paper is trying to address.

Multi-agent systems suffer visual hallucination snowballing
Over-reliance on textual flow reduces visual attention
Vision tokens diminish causing hallucination amplification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Flow relays inter-agent messages
Selects visual tokens with unimodal attention
Applies attention reallocation to reduce hallucinations
🔎 Similar Papers
No similar papers found.