HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing multimodal large language model (MLLM) benchmarks predominantly focus on STEM domains, neglecting the interdisciplinary lateral reasoning and abstract concept–visual representation alignment capabilities essential for the humanities and social sciences (HSS). Method: We introduce the first multilingual (UN’s six official languages), HSS-oriented multimodal benchmark, featuring a novel HSS-specific evaluation paradigm across six core competency dimensions. Our data generation pipeline integrates expert-AI collaborative curation, iterative refinement via AI agents, multilingual prompt engineering, cross-modal alignment assessment, and fine-grained human verification—yielding 13,000+ high-quality samples. Contribution/Results: Empirical evaluation reveals that state-of-the-art MLLMs achieve only 41.7% average accuracy on HSS tasks, exposing critical deficits in interdisciplinary reasoning. This benchmark establishes foundational infrastructure and a standardized reference for developing and evaluating HSS-aware multimodal models.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability in Humanities and Social Sciences

Addressing lack of interdisciplinary thinking in current benchmarks

Challenging MLLMs to link abstract concepts with visual representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark for HSS tasks

Expert-agent collaborative data generation

Over 13,000 interdisciplinary samples

🔎 Similar Papers

No similar papers found.