InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing benchmarks inadequately evaluate the human–AI interaction intelligence of large multimodal models (LMMs), particularly their ability to dynamically revise outputs in response to human feedback. Method: We propose InterFeedback—the first autonomous evaluation framework for interaction intelligence—featuring an interactive evaluation paradigm, an automated assessment system generalizable across arbitrary LMMs, the dual-modal benchmark InterFeedback-Bench, and the human-validated set InterFeedback-Human. Our methodology integrates interactive prompt engineering, feedback-response modeling, multi-turn trajectory analysis, and cross-dataset consistency evaluation. Contribution/Results: Experiments reveal that state-of-the-art LMMs—including OpenAI-o1—achieve less than 50% success rate in correctly revising outputs based on human feedback, exposing a critical bottleneck in interaction intelligence. InterFeedback establishes a novel, quantifiable paradigm and foundational toolkit for rigorously assessing and iteratively improving LMMs’ interactive capabilities.

Technology Category

Application Category

📝 Abstract

Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-3.5-Sonnet. Our evaluation results show that even state-of-the-art LMM (like OpenAI-o1) can correct their results through human feedback less than 50%. Our findings point to the need for methods that can enhance the LMMs' capability to interpret and benefit from feedback.

Problem

Research questions and friction points this paper is trying to address.

Assessing interactive intelligence in LMMs

Developing feedback interpretation methods

Evaluating LMMs with human feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive framework for LMMs

Human feedback evaluation datasets

Enhancing LMMs with feedback interpretation

🔎 Similar Papers

No similar papers found.