PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Current video understanding benchmarks struggle to evaluate models’ capacity for long-horizon, multi-step, and perception-intensive compositional reasoning. To address this gap, this work introduces a manually annotated video benchmark specifically designed for complex perception-centric reasoning, systematically defining and implementing long-horizon tasks that require multiple perceptual interactions and composite logical structures—namely conjunctions and sequential dependencies. The benchmark spans diverse domains, including urban walks, indoor tours, and gameplay footage, and employs a five-option question-answering format. Evaluation combines insights from human cognitive behavior studies with assessments using state-of-the-art multimodal large language models. Experimental results reveal a significant performance gap: human accuracy drops to 18.97% without video replay, while the top-performing model, Gemini-1.5-Flash, achieves only 45.96%, and all open-source models score below 40%, highlighting critical limitations in current approaches to integrating visual evidence across time and performing deep compositional reasoning.

Technology Category

Application Category

📝 Abstract

We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.

Problem

Research questions and friction points this paper is trying to address.

perception-centric reasoning

long-horizon video understanding

compositional visual reasoning

temporal reasoning

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

perception-centric reasoning

long-horizon video understanding

compositional visual reasoning