🤖 AI Summary
This work addresses the limitations of existing intent recognition datasets, which are largely confined to single-turn or simplistic dialogues and thus ill-suited for capturing complex intentions in real-world scenarios involving prolonged, multi-turn interactions with strategic deception. To bridge this gap, we introduce MISID, a novel dataset derived from high-stakes social strategy games, featuring multimodal, multi-turn, and multi-participant interactions. MISID is accompanied by a fine-grained, two-tiered multidimensional annotation framework specifically designed for strategic deception. Building upon this foundation, we propose FRACTAM, a framework adopting a “disentangle-anchor-reason” paradigm that leverages explicit cross-modal evidence chains and long-context discourse analysis to mitigate text bias and insufficient cross-modal coordination in prevailing multimodal large language models. Experiments demonstrate that FRACTAM significantly enhances performance in hidden intent detection and reasoning while maintaining robust perceptual accuracy, establishing a new benchmark and solution for intent recognition in complex strategic interactions.
📝 Abstract
Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods. To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition. Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking. Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) on MISID reveals critical deficiencies in complex scenarios, including text-prior visual hallucination, impaired cross-modal synergy, and limited capacity in chaining causal cues. Consequently, we propose FRACTAM as a baseline framework. Using a ``Decouple-Anchor-Reason'' paradigm, FRACTAM reduces text bias by extracting pure unimodal factual representations, employs two-stage retrieval for long-range factual anchoring, and constructs explicit cross-modal evidence chains. Extensive experiments demonstrate that FRACTAM enhances mainstream models' performance in complex strategic tasks, improving hidden intent detection and inference while maintaining robust perceptual accuracy. Our dataset is available at https://naislab.cn/datasets/MISID.