🤖 AI Summary
Existing evaluation benchmarks struggle to assess the capability of multimodal large language models (MLLMs) to handle interleaved active reasoning and reactive querying within dynamic, multi-turn interactions over continuous visual streams. To address this gap, this work introduces IPIBench, the first interactive active intelligence benchmark tailored for continuous video streams, which establishes a novel evaluation paradigm supporting the entanglement of dynamic multi-turn reactions and proactive behaviors. Furthermore, the authors propose IPI-Agent, a training-free framework integrating a temporal gating mechanism and an interaction control strategy to enhance behavioral coordination in MLLMs. Experimental results demonstrate that IPIBench effectively evaluates state-of-the-art MLLMs, while IPI-Agent significantly improves their performance in terms of proactive triggering stability and the coordination between reactive and active behaviors.
📝 Abstract
Recent multimodal large language models (MLLMs) achieve strong performance on reactive question answering, but real-world streaming assistants require proactive reasoning over continuous visual inputs. Existing benchmarks mainly study reactive or proactive interactions in isolated single-turn settings, overlooking dynamic multi-turn scenarios where users may add, modify, or cancel proactive requests alongside interleaved reactive queries. To address this gap, we introduce IPIBench, the first benchmark for evaluating Interactive Proactive Intelligence of MLLMs under streaming video settings. IPIBench covers proactive monitoring, proactive task management, and interleaved reactive-proactive requests. Evaluations on representative MLLMs reveal two major limitations: unstable proactive triggering and weak coordination between reactive and proactive behaviors. We further propose IPI-Agent, a training-free agentic framework with an interaction-control policy and a temporal-gating mechanism for stabilizing proactive triggering and coordinating multi-turn interactions. Experiments show that IPI-Agent consistently improves existing MLLMs across all benchmark settings.