Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the low-resource challenges in multimodal hate meme detection for Bengali, including data scarcity, class imbalance, and code-mixing. To tackle these issues, the authors propose the RAG-Fused DORA architecture, which integrates the BHM and MIMOSA datasets to enhance semantic diversity and class balance. The framework leverages multiple multimodal encoders—CLIP, DINOv2, XLM-R, and XGLM—and incorporates a dual co-attention mechanism. It further employs a FAISS-based k-nearest neighbor classifier combined with retrieval-augmented generation (RAG) for inference. Evaluated on an expanded dataset, the approach achieves macro-average F1 scores of 0.79 for hate detection and 0.74 for target entity recognition, significantly outperforming baseline methods and demonstrating strong robustness on rare classes.

Technology Category

Application Category

📝 Abstract

Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further evaluate LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings. Experiments on the extended dataset show that xDORA (CLIP + XLM-R) achieves macro-average F1-scores of 0.78 for hateful meme identification and 0.71 for target entity detection, while RAG-Fused DORA improves performance to 0.79 and 0.74, yielding gains over the DORA baseline. The FAISS-based classifier performs competitively and demonstrates robustness for rare classes through semantic similarity modeling. In contrast, LLaVA exhibits limited effectiveness in few-shot settings, with only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning. These findings demonstrate the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.

Problem

Research questions and friction points this paper is trying to address.

hateful meme detection

low-resource languages

multimodal content

code-mixing

class imbalance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation

Dual Co-Attention

Multimodal Hateful Meme Detection