CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation

📅 2025-08-10

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing multimodal large language model (MLLM) evaluation benchmarks predominantly focus on English text or images, lacking systematic assessment of multilingual speech understanding and factual reasoning capabilities. To address this gap, we introduce CCFQA—the first cross-lingual, cross-modal factual evaluation benchmark—comprising parallel speech-text fact-based question-answering data across eight languages. Leveraging CCFQA, we empirically uncover pervasive hallucination issues in mainstream MLLMs for multilingual spoken QA. We further propose a few-shot transfer strategy that effectively adapts English factual reasoning capabilities to multilingual speech understanding using only five demonstration examples. Experimental results demonstrate that our method achieves performance on par with GPT-4o-mini-Audio and substantially outperforms baseline models. The code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel extbf{C}ross-lingual and extbf{C}ross-modal extbf{F}actuality benchmark ( extbf{CCFQA}). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs' cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at https://github.com/yxduir/ccfqa.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual speech-text factuality in MLLMs

Bridging cross-lingual cross-modal evaluation gaps

Enhancing MLLMs' speech understanding reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual and cross-modal factuality benchmark

Few-shot transfer learning for multilingual SQA

Parallel speech-text questions in 8 languages

🔎 Similar Papers

Selected Languages are All You Need for Cross-lingual Truthfulness Transfer

2024-06-20International Conference on Computational LinguisticsCitations: 2

Can LLMs Improve Multimodal Fact-Checking by Asking Relevant Questions?

2024-10-06Citations: 5