🤖 AI Summary
To address the large literal-to-metaphorical semantic gap between visual and textual modalities in internet memes and the high computational cost of generative approaches, this paper proposes a lightweight and efficient multimodal metaphor recognition framework. Methodologically, it integrates CLIP-based cross-modal encoding, prompt-guided feature fusion, and a novel Spherical Linear Interpolation (SLERP)-based concept drift mechanism operating on CLIP embeddings to explicitly model the semantic evolution from literal to metaphorical interpretations. Additionally, it introduces an adapter-style LayerNorm fine-tuning strategy that optimizes only normalization layer parameters, drastically reducing training overhead. Evaluated on the MET-Meme benchmark, the method achieves state-of-the-art performance while reducing training FLOPs by an order of magnitude compared to mainstream generative models. Ablation studies confirm statistically significant improvements attributable to both innovations.
📝 Abstract
Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces extbf{C}oncept extbf{D}rift extbf{G}uided extbf{L}ayerNorm extbf{T}uning ( extbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.