WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenge of evaluating visual question answering (VQA) capabilities of wearable devices—such as smart glasses—under realistic first-person-view conditions. To this end, we introduce WearVQA, the first dedicated benchmark for wearable VQA. It systematically models dual challenges in real-world wearable scenarios: visual quality degradation (e.g., occlusion, motion blur, low illumination) and semantic understanding, spanning seven image domains, ten cognitive tasks, and six common imaging defects. Methodologically, WearVQA integrates human-annotated image–question–answer triplets with an LLM-as-a-judge automated evaluation framework, enabling fine-grained assessment of both recognition accuracy and multi-step reasoning capability. Experiments reveal that state-of-the-art open-source and commercial multimodal large language models achieve only 24%–52% accuracy on WearVQA, with pronounced performance degradation on low-quality images and complex reasoning tasks—highlighting critical robustness bottlenecks in practical deployment. The benchmark is publicly released to advance trustworthy evaluation and development of wearable multimodal AI.

Technology Category

Application Category

📝 Abstract

We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering (VQA) capabilities of multi-model AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique challenges of ego-centric interaction-where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluation framework with 96% labeling accuracy. Open-source and proprietary multi-model LLMs achieved a QA accuracy as low as 24-52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark for guiding technical advancement towards robust, real-world multi-model wearables AI systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluates VQA for wearables in real-world ego-centric scenarios

Addresses challenges like occlusion, poor lighting, and blurry images

Tests AI on diverse tasks from recognition to complex reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for VQA on wearable devices

Evaluates AI on occluded, blurry, ego-centric images

Uses LLM-as-a-judge framework with high accuracy

🔎 Similar Papers

No similar papers found.