EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing first-person vision (FPV) understanding benchmarks severely lack support for nighttime scenarios. This paper introduces EgoNight, the first FPV visual question answering (VQA) benchmark specifically designed for nocturnal environments. Methodologically, we construct a day-night aligned real-synthetic paired video dataset, leveraging multimodal large language models (MLLMs) for automated annotation followed by human refinement, yielding a high-quality dataset of 3,658 QA pairs across 90 videos and 12 question types. We further propose two auxiliary tasks—nighttime-relevant retrieval and night-specific depth estimation—to advance cross-illumination-domain generalization. Empirically, we systematically evaluate mainstream MLLMs and reveal a substantial performance degradation under low-light conditions—average accuracy drops by over 25%—thereby validating both the task’s difficulty and the benchmark’s efficacy. Collectively, EgoNight establishes a foundational resource for advancing robust, illumination-invariant FPV understanding.

Technology Category

Application Category

📝 Abstract
Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Addressing the lack of egocentric vision benchmarks for nighttime scenarios
Developing visual question answering for low-light egocentric videos
Investigating performance gaps between daytime and nighttime vision models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Day-night aligned videos enhance annotation quality
Day-augmented night auto-labeling engine with human verification
Synthetic and real-world recordings ensure visual alignment
🔎 Similar Papers
No similar papers found.