MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing benchmarks inadequately evaluate the joint multilingual and multimodal capabilities of Multimodal Large Language Models (MLLMs), as they are typically restricted to English, single-modality inputs, short-context settings, and lack human-annotated ground truth. To address this, we introduce MCIF—the first instruction-following benchmark for multilingual multimodal evaluation, grounded in authentic scientific talks. MCIF covers four languages (English, German, Italian, Chinese) and three modalities (speech, vision, text), enabling unified assessment of both short- and long-context instruction following. It is the first benchmark to integrate multilingualism, multimodality, and long-context reasoning under a single, human-annotated, instruction-context aligned dataset. Released under the CC-BY 4.0 license, MCIF provides a standardized, cross-lingual, and generalizable evaluation platform for MLLMs, bridging critical gaps in assessing linguistic diversity and modality fusion.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations--hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities--speech, vision, and text--and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual and multimodal capabilities of MLLMs

Addressing lack of diverse language and modality benchmarks

Assessing instruction-following in crosslingual and multimodal contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual human-annotated benchmark for MLLMs

Integrates speech, vision, text across languages

Evaluates short- and long-form multimodal instructions

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models