π€ AI Summary
Existing vision-language models exhibit significant limitations in multi-hop reasoning within multi-player game scenarios involving incomplete and deceptive information. This work proposes a collaborative multi-agent framework that introduces a role-customized murder-mystery script generation mechanism and a two-stage agent supervision training paradigm, integrating chain-of-thought fine-tuning, GRPO reinforcement learning, and multimodal contextual modeling. The approach establishes the first scalable training and evaluation paradigm for vision-language models specifically designed for environments with deceptive and incomplete information. Experimental results demonstrate substantial improvements in narrative reasoning, extraction of hidden facts, and robustness against deception, highlighting the frameworkβs effectiveness in enhancing complex reasoning capabilities under uncertainty.
π Abstract
Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfect and deceptive information. In this paper, we study a representative multiplayer task, Murder Mystery Games, which require inferring hidden truths based on partial clues provided by roles with different intentions. To address this challenge, we propose a collaborative multi-agent framework for evaluating and synthesizing high-quality, role-driven multiplayer game scripts, enabling fine-grained interaction patterns tailored to character identities (i.e., murderer vs. innocent). Our system generates rich multimodal contexts, including character backstories, visual and textual clues, and multi-hop reasoning chains, through coordinated agent interactions. We design a two-stage agent-monitored training strategy to enhance the reasoning ability of VLMs: (1) chain-of-thought based fine-tuning on curated and synthetic datasets that model uncertainty and deception; (2) GRPO-based reinforcement learning with agent-monitored reward shaping, encouraging the model to develop character-specific reasoning behaviors and effective multimodal multi-hop inference. Extensive experiments demonstrate that our method significantly boosts the performance of VLMs in narrative reasoning, hidden fact extraction, and deception-resilient understanding. Our contributions offer a scalable solution for training and evaluating VLMs under uncertain, adversarial, and socially complex conditions, laying the groundwork for future benchmarks in multimodal multi-hop reasoning under imperfect information.