HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language model (MLLM) benchmarks predominantly focus on STEM domains, neglecting the interdisciplinary lateral reasoning and abstract concept–visual representation alignment capabilities essential for the humanities and social sciences (HSS). Method: We introduce the first multilingual (UN’s six official languages), HSS-oriented multimodal benchmark, featuring a novel HSS-specific evaluation paradigm across six core competency dimensions. Our data generation pipeline integrates expert-AI collaborative curation, iterative refinement via AI agents, multilingual prompt engineering, cross-modal alignment assessment, and fine-grained human verification—yielding 13,000+ high-quality samples. Contribution/Results: Empirical evaluation reveals that state-of-the-art MLLMs achieve only 41.7% average accuracy on HSS tasks, exposing critical deficits in interdisciplinary reasoning. This benchmark establishes foundational infrastructure and a standardized reference for developing and evaluating HSS-aware multimodal models.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability in Humanities and Social Sciences
Addressing lack of interdisciplinary thinking in current benchmarks
Challenging MLLMs to link abstract concepts with visual representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark for HSS tasks
Expert-agent collaborative data generation
Over 13,000 interdisciplinary samples
🔎 Similar Papers
No similar papers found.
Z
Zhaolu Kang
TeleAI, China Telecom
J
Junhao Gong
Peking University
J
Jiaxu Yan
Chinese Academy of Sciences
W
Wanke Xia
Tsinghua University
Y
Yian Wang
Chinese Academy of Sciences
Ziwen Wang
Ziwen Wang
University of Illinois Urbana-Champaign; New York University
Machine LearningDeep LearningBioinformaticsComputer Vision
H
Huaxuan Ding
Peking University
Zhuo Cheng
Zhuo Cheng
CMU
system
W
Wenhao Cao
Renmin University of China
Z
Zhiyuan Feng
Tsinghua University
S
Siqi He
Peking University
S
Shannan Yan
Tsinghua University
Junzhe Chen
Junzhe Chen
Tsinghua University
Natural Language Processing
Xiaomin He
Xiaomin He
Columbia University
Chaoya Jiang
Chaoya Jiang
Shandong University
Multimodal Large Language Model
W
Wei Ye
Peking University
K
Kaidong Yu
TeleAI, China Telecom
X
Xuelong Li
TeleAI, China Telecom