🤖 AI Summary
This work addresses the attenuation of visual attention in large vision-language models during long-sequence generation, a phenomenon caused by the accumulation of textual history that dilutes visual signals. To mitigate this issue, the authors propose a Persistent Visual Memory (PVM) module—a lightweight, parallel branch to the feedforward network—that incorporates learnable visual memory and a distance-agnostic embedding retrieval mechanism. This design enables sustained and on-demand access to visual information throughout deep generation processes. When integrated into Qwen3-VL (4B/8B), PVM significantly improves average accuracy on complex visual reasoning tasks, effectively counteracts performance degradation due to increasing sequence length, and accelerates convergence of internal predictions.
📝 Abstract
While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.