Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) face significant bottlenecks in cognitive-level multimodal semantic understanding—e.g., intent, emotion, and dialogue acts—and lack systematic evaluation benchmarks. To address this, we introduce MMLA, the first benchmark dedicated to cognitive semantic understanding in multimodal language analysis, comprising over 61K real-world multimodal utterances spanning six core semantic dimensions. We design a unified evaluation framework supporting zero-shot inference, supervised fine-tuning, and instruction tuning for fair cross-paradigm comparison. Extensive experiments across eight state-of-the-art LLMs/MLLMs examine cross-modal alignment, semantic disentanglement, and multi-task joint modeling. Results reveal that even the best fine-tuned models achieve only 60–70% accuracy across the six tasks, exposing fundamental limitations in deep semantic comprehension. All data and code are publicly released.

Technology Category

Application Category

📝 Abstract
Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.
Problem

Research questions and friction points this paper is trying to address.

Assessing MLLMs' ability to understand cognitive-level semantics in multimodal language.
Evaluating LLMs and MLLMs on six core dimensions of multimodal semantics.
Addressing limitations of current MLLMs in complex human language understanding.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive multimodal benchmark MMLA introduced
Evaluated LLMs with three tuning methods
Open-sourced datasets and code provided
🔎 Similar Papers
No similar papers found.
H
Hanlei Zhang
Department of Computer Science and Technology, Tsinghua University
Zhuohang Li
Zhuohang Li
Vanderbilt University
Yeshuang Zhu
Yeshuang Zhu
WeChat - Basic Architecture Dept., Tencent Inc.
natural language processingimage/video generationhuman-computer interaction
H
Hua Xu
Department of Computer Science and Technology, Tsinghua University
P
Peiwu Wang
Department of Computer Science and Technology, Tsinghua University
Jinchao Zhang
Jinchao Zhang
WeChat AI - Pattern Recognition Center
Deep LearningNatural Language ProcessingMachine TranslationDialogue System
J
Jie Zhou
Pattern Recognition Center, WeChat AI, Tencent Inc, China
H
Haige Zhu
Kennesaw State University