PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video understanding benchmarks struggle to evaluate models’ capacity for long-horizon, multi-step, and perception-intensive compositional reasoning. To address this gap, this work introduces a manually annotated video benchmark specifically designed for complex perception-centric reasoning, systematically defining and implementing long-horizon tasks that require multiple perceptual interactions and composite logical structures—namely conjunctions and sequential dependencies. The benchmark spans diverse domains, including urban walks, indoor tours, and gameplay footage, and employs a five-option question-answering format. Evaluation combines insights from human cognitive behavior studies with assessments using state-of-the-art multimodal large language models. Experimental results reveal a significant performance gap: human accuracy drops to 18.97% without video replay, while the top-performing model, Gemini-1.5-Flash, achieves only 45.96%, and all open-source models score below 40%, highlighting critical limitations in current approaches to integrating visual evidence across time and performing deep compositional reasoning.
📝 Abstract
We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.
Problem

Research questions and friction points this paper is trying to address.

perception-centric reasoning
long-horizon video understanding
compositional visual reasoning
temporal reasoning
multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

perception-centric reasoning
long-horizon video understanding
compositional visual reasoning
temporal-spatial reasoning
manually annotated benchmark
🔎 Similar Papers
No similar papers found.