Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a pervasive text-modal bias in multimodal intent detection, wherein mainstream datasets over-rely on textual cues, causing purely textual large language models (e.g., Mistral-7B) to significantly outperform multimodal models—thus distorting the true assessment of multimodal capabilities. Method: We propose the first data debiasing framework specifically for intent detection, which systematically identifies and attenuates text-dominant cues to construct a more balanced evaluation benchmark. Contribution/Results: Empirical analysis reveals that on original data, text-only models achieve up to 23.6% higher accuracy than multimodal models; after debiasing, all models suffer substantial performance drops (average decline of 18.4%), confirming that text bias severely masks multimodal models’ latent potential. This work challenges prevailing evaluation paradigms and establishes a methodological foundation and a new, fairer benchmark for trustworthy multimodal intent understanding research.

Technology Category

Application Category

📝 Abstract
The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.
Problem

Research questions and friction points this paper is trying to address.

Investigating modality bias in multimodal intent detection datasets
Evaluating performance of text-only versus multimodal models on biased data
Proposing framework to debias datasets and assess true multimodal effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-only LLM outperforms multimodal models
Proposes framework to debias multimodal datasets
Analyzes context-specific relevance of modalities
🔎 Similar Papers
No similar papers found.