🤖 AI Summary
This work addresses the challenge of evaluating visual question answering (VQA) capabilities of wearable devices—such as smart glasses—under realistic first-person-view conditions. To this end, we introduce WearVQA, the first dedicated benchmark for wearable VQA. It systematically models dual challenges in real-world wearable scenarios: visual quality degradation (e.g., occlusion, motion blur, low illumination) and semantic understanding, spanning seven image domains, ten cognitive tasks, and six common imaging defects. Methodologically, WearVQA integrates human-annotated image–question–answer triplets with an LLM-as-a-judge automated evaluation framework, enabling fine-grained assessment of both recognition accuracy and multi-step reasoning capability. Experiments reveal that state-of-the-art open-source and commercial multimodal large language models achieve only 24%–52% accuracy on WearVQA, with pronounced performance degradation on low-quality images and complex reasoning tasks—highlighting critical robustness bottlenecks in practical deployment. The benchmark is publicly released to advance trustworthy evaluation and development of wearable multimodal AI.
📝 Abstract
We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering (VQA) capabilities of multi-model AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique challenges of ego-centric interaction-where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluation framework with 96% labeling accuracy. Open-source and proprietary multi-model LLMs achieved a QA accuracy as low as 24-52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark for guiding technical advancement towards robust, real-world multi-model wearables AI systems.