🤖 AI Summary
Large vision-language models (LVLMs) exhibit weak zero-shot performance and incur high fine-tuning costs for fake news detection (FND).
Method: We propose IMFND—a novel framework that injects prior knowledge from lightweight models (e.g., CLIP) into LVLM context examples via predicted probabilities, guiding the LVLM to attend to high-risk multimodal fragments without fine-tuning. IMFND unifies zero-shot inference, standard and enhanced in-context learning (ICL), CLIP probability distillation, and multimodal reasoning with CogVLM or GPT-4V.
Contribution/Results: Evaluated on three public benchmarks, IMFND significantly outperforms standard ICL and achieves accuracy comparable to—or exceeding—that of fine-tuned small models (e.g., BERT). It establishes LVLMs as efficient, training-free multimodal classifiers, demonstrating a new paradigm for zero-shot and in-context FND.
📝 Abstract
Large visual-language models (LVLMs) exhibit exceptional performance in visual-language reasoning across diverse cross-modal benchmarks. Despite these advances, recent research indicates that Large Language Models (LLMs), like GPT-3.5-turbo, underachieve compared to well-trained smaller models, such as BERT, in Fake News Detection (FND), prompting inquiries into LVLMs' efficacy in FND tasks. Although performance could improve through fine-tuning LVLMs, the substantial parameters and requisite pre-trained weights render it a resource-heavy endeavor for FND applications. This paper initially assesses the FND capabilities of two notable LVLMs, CogVLM and GPT4V, in comparison to a smaller yet adeptly trained CLIP model in a zero-shot context. The findings demonstrate that LVLMs can attain performance competitive with that of the smaller model. Next, we integrate standard in-context learning (ICL) with LVLMs, noting improvements in FND performance, though limited in scope and consistency. To address this, we introduce the extbf{I}n-context extbf{M}ultimodal extbf{F}ake extbf{N}ews extbf{D}etection (IMFND) framework, enriching in-context examples and test inputs with predictions and corresponding probabilities from a well-trained smaller model. This strategic integration directs the LVLMs' focus towards news segments associated with higher probabilities, thereby improving their analytical accuracy. The experimental results suggest that the IMFND framework significantly boosts the FND efficiency of LVLMs, achieving enhanced accuracy over the standard ICL approach across three publicly available FND datasets.