CaughtCheating: Is Your MLLM a Good Cheating Detective? Exploring the Boundary of Visual Perception and Reasoning

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the fundamental limits of multimodal large language models (MLLMs) in human-like detective-style visual reasoning—specifically, detecting subtle cheating cues in images and constructing coherent, causally grounded scene explanations. Method: We introduce CaughtCheating, the first benchmark explicitly designed for realistic social scenarios, emphasizing fine-grained visual perception and causal, contextual inference; it comprises a high-quality, expert-annotated dataset and a systematic evaluation framework leveraging controlled prompting and ablation studies. Contribution/Results: Comprehensive evaluation of state-of-the-art MLLMs—including GPT-4o—reveals near-zero accuracy, exposing critical deficiencies in cross-modal cue integration, implicit relational modeling, and counterfactual reasoning. Our benchmark establishes a rigorous new evaluation standard and identifies concrete, actionable directions for advancing deep visual understanding in MLLMs.

Technology Category

Application Category

📝 Abstract
Recent agentic Multi-Modal Large Language Models (MLLMs) such as GPT-o3 have achieved near-ceiling scores on various existing benchmarks, motivating a demand for more challenging test tasks. These MLLMs have been reported to excel in a few expert-level tasks for humans, e.g., GeoGuesser, reflecting their potential as a detective who can notice minuscule cues in an image and weave them into coherent, situational explanations, leading to a reliable answer. But can they match the performance of excellent human detectives? To answer this question, we investigate some hard scenarios where GPT-o3 can still handle, and find a common scenario where o3's performance drops to nearly zero, which we name CaughtCheating. It is inspired by the social media requests that ask others to detect suspicious clues from photos shared by the poster's partner. We conduct extensive experiments and analysis to understand why existing MLLMs lack sufficient capability to solve this kind of task. CaughtCheating provides a class of challenging visual perception and reasoning tasks with great value and practical usage. Success in these tasks paves the way for MLLMs to acquire human-level detective perception and reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Assessing MLLMs' ability to detect subtle cheating clues in images
Exploring MLLMs' limitations in visual perception and reasoning tasks
Developing challenging benchmarks for human-level detective capabilities in MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLMs detect minuscule visual cues
Challenging scenarios test reasoning limits
Human-level detective capabilities benchmark
🔎 Similar Papers
No similar papers found.
M
Ming Li
University of Maryland, College Park
C
Chenguang Wang
University of Maryland, College Park
Yijun Liang
Yijun Liang
yliang17@umd.edu
Xiyao Wang
Xiyao Wang
Ph.D. in University of Maryland, College Park
World ModelEmbodied AIMultimodel LLM
Y
Yuhang Zhou
University of Maryland, College Park
Xiyang Wu
Xiyang Wu
University of Maryland
Reinforcement LearningRoboticsLarge Language ModelVision Language Model
Yuqing Zhang
Yuqing Zhang
University of Groningen
computational linguisticsspeech processing
R
Ruiyi Zhang
University of Maryland, College Park
T
Tianyi Zhou
University of Maryland, College Park