🤖 AI Summary
This work first systematically exposes privacy leakage risks in multimodal retrieval-augmented generation (MRAG) systems across vision-language and speech-language scenarios. Addressing the gap in existing RAG privacy research—largely confined to text-only modalities—we propose the first cross-modal privacy threat taxonomy for MRAG. We further design the first black-box, compositional, structured prompt attack tailored to multimodal RAG, empirically uncovering two distinct leakage pathways in large multimodal models (LMMs): “direct reproduction” and “semantic inference.” Evaluations across multiple mainstream MRAG systems demonstrate the attack’s effectiveness: private content—including image captions and speech transcriptions—is successfully extracted, with a maximum leakage rate of 78.3%. Our analysis reveals that current MRAG systems widely lack coordinated, cross-modal privacy safeguards. These findings underscore the urgent need to develop robust, privacy-preserving MRAG frameworks capable of mitigating leakage across heterogeneous modalities.
📝 Abstract
Multimodal Retrieval-Augmented Generation (MRAG) systems enhance LMMs by integrating external multimodal databases, but introduce unexplored privacy vulnerabilities. While text-based RAG privacy risks have been studied, multimodal data presents unique challenges. We provide the first systematic analysis of MRAG privacy vulnerabilities across vision-language and speech-language modalities. Using a novel compositional structured prompt attack in a black-box setting, we demonstrate how attackers can extract private information by manipulating queries. Our experiments reveal that LMMs can both directly generate outputs resembling retrieved content and produce descriptions that indirectly expose sensitive information, highlighting the urgent need for robust privacy-preserving MRAG techniques.