InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) lack robust capabilities for sustained, interleaved multi-round multimodal interaction, hindering their ability to comprehend entangled cross-modal contexts and generate coherent, context-aware responses. Method: We propose InterMT—a first-of-its-kind preference dataset grounded in real human feedback, comprising 15.6k prompts, 52.6k dialogue instances, and 32.4k preference pairs—accompanied by a novel nine-dimensional global+local preference annotation protocol. We further introduce a tool-augmented agent workflow to automatically generate high-quality multi-round QA, establish an interleaved multimodal context modeling framework, and release the benchmark InterMT-Bench. Contribution/Results: Experiments reveal that judge model performance follows a power-law improvement with increasing dialogue rounds. Our open-sourced dataset and methodology significantly enhance MLLM alignment with human intent in multi-turn interactive settings.

Technology Category

Application Category

📝 Abstract
As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: What essential capabilities are still missing? A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation. To move closer to human-level intelligence, models must similarly support multi-turn, multimodal interaction. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges. In this work, we present an initial exploration through the InterMT -- the first preference dataset for multi-turn multimodal interaction, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. InterMT captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs. To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances. To further this goal, we introduce InterMT-Bench to assess the ability of MLLMs in assisting judges with multi-turn, multimodal tasks. We demonstrate the utility of InterMT through applications such as judge moderation and further reveal the multi-turn scaling law of judge model. We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step. Our project website can be found at https://pku-intermt.github.io .
Problem

Research questions and friction points this paper is trying to address.

Lack of multi-turn multimodal interaction in MLLMs
Need for human-aligned preference datasets in multimodal contexts
Absence of tools to assess MLLMs' multi-turn multimodal capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn multimodal interaction dataset
Tool-augmented MLLMs workflow
Human oversight with expert annotations
🔎 Similar Papers
No similar papers found.
B
Boyuan Chen
Peking University
Donghai Hong
Donghai Hong
Peking University
AI SafetyAI AlignmentMulti-Modal Model
J
Jiaming Ji
Peking University
J
Jiacheng Zheng
Hong Kong University of Science and Technology
B
Bowen Dong
Peking University
J
Jiayi Zhou
Peking University
Kaile Wang
Kaile Wang
Peking University
J
Juntao Dai
Peking University
X
Xuyao Wang
Peking University
W
Wenqi Chen
Peking University
Q
Qirui Zheng
Peking University
W
Wenxin Li
Peking University
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
Y
Yike Guo
Hong Kong University of Science and Technology
Y
Yaodong Yang
Peking University