🤖 AI Summary
This paper identifies a pervasive text-modal bias in multimodal intent detection, wherein mainstream datasets over-rely on textual cues, causing purely textual large language models (e.g., Mistral-7B) to significantly outperform multimodal models—thus distorting the true assessment of multimodal capabilities.
Method: We propose the first data debiasing framework specifically for intent detection, which systematically identifies and attenuates text-dominant cues to construct a more balanced evaluation benchmark.
Contribution/Results: Empirical analysis reveals that on original data, text-only models achieve up to 23.6% higher accuracy than multimodal models; after debiasing, all models suffer substantial performance drops (average decline of 18.4%), confirming that text bias severely masks multimodal models’ latent potential. This work challenges prevailing evaluation paradigms and establishes a methodological foundation and a new, fairer benchmark for trustworthy multimodal intent understanding research.
📝 Abstract
The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.