🤖 AI Summary
This work addresses the low-resource challenges in multimodal hate meme detection for Bengali, including data scarcity, class imbalance, and code-mixing. To tackle these issues, the authors propose the RAG-Fused DORA architecture, which integrates the BHM and MIMOSA datasets to enhance semantic diversity and class balance. The framework leverages multiple multimodal encoders—CLIP, DINOv2, XLM-R, and XGLM—and incorporates a dual co-attention mechanism. It further employs a FAISS-based k-nearest neighbor classifier combined with retrieval-augmented generation (RAG) for inference. Evaluated on an expanded dataset, the approach achieves macro-average F1 scores of 0.79 for hate detection and 0.74 for target entity recognition, significantly outperforming baseline methods and demonstrating strong robustness on rare classes.
📝 Abstract
Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further evaluate LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings. Experiments on the extended dataset show that xDORA (CLIP + XLM-R) achieves macro-average F1-scores of 0.78 for hateful meme identification and 0.71 for target entity detection, while RAG-Fused DORA improves performance to 0.79 and 0.74, yielding gains over the DORA baseline. The FAISS-based classifier performs competitively and demonstrates robustness for rare classes through semantic similarity modeling. In contrast, LLaVA exhibits limited effectiveness in few-shot settings, with only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning. These findings demonstrate the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.