WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of evaluating visual question answering (VQA) capabilities of wearable devices—such as smart glasses—under realistic first-person-view conditions. To this end, we introduce WearVQA, the first dedicated benchmark for wearable VQA. It systematically models dual challenges in real-world wearable scenarios: visual quality degradation (e.g., occlusion, motion blur, low illumination) and semantic understanding, spanning seven image domains, ten cognitive tasks, and six common imaging defects. Methodologically, WearVQA integrates human-annotated image–question–answer triplets with an LLM-as-a-judge automated evaluation framework, enabling fine-grained assessment of both recognition accuracy and multi-step reasoning capability. Experiments reveal that state-of-the-art open-source and commercial multimodal large language models achieve only 24%–52% accuracy on WearVQA, with pronounced performance degradation on low-quality images and complex reasoning tasks—highlighting critical robustness bottlenecks in practical deployment. The benchmark is publicly released to advance trustworthy evaluation and development of wearable multimodal AI.

Technology Category

Application Category

📝 Abstract
We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering (VQA) capabilities of multi-model AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique challenges of ego-centric interaction-where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluation framework with 96% labeling accuracy. Open-source and proprietary multi-model LLMs achieved a QA accuracy as low as 24-52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark for guiding technical advancement towards robust, real-world multi-model wearables AI systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluates VQA for wearables in real-world ego-centric scenarios
Addresses challenges like occlusion, poor lighting, and blurry images
Tests AI on diverse tasks from recognition to complex reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for VQA on wearable devices
Evaluates AI on occluded, blurry, ego-centric images
Uses LLM-as-a-judge framework with high accuracy
🔎 Similar Papers
No similar papers found.
E
Eun Chang
Meta Reality Labs
Z
Zhuangqun Huang
Meta Reality Labs
Y
Yiwei Liao
Meta Reality Labs
S
Sagar Ravi Bhavsar
Meta Reality Labs
A
Amogh Param
Meta Reality Labs
T
Tammy Stark
Meta Reality Labs
Adel Ahmadyan
Adel Ahmadyan
Meta Reality Labs
X
Xiao Yang
Meta Reality Labs
J
Jiaqi Wang
Meta Reality Labs
A
Ahsan Abdullah
Meta Reality Labs
G
Giang Nguyen
Meta Reality Labs
A
Akil Iyer
Meta Reality Labs
David Hall
David Hall
Research Scientist, CSIRO
Computer Science
E
Elissa Li
Meta
S
Shane Moon
Meta Reality Labs
N
Nicolas Scheffer
Meta Reality Labs
K
Kirmani Ahmed
Meta Reality Labs
B
Babak Damavandi
Meta Reality Labs
R
Rakesh Wanga
Meta Reality Labs
A
Anuj Kumar
Meta Reality Labs
R
Rohit Patel
Meta
Xin Luna Dong
Xin Luna Dong
ACM / IEEE Fellow, Principal Scientist at Meta
Knowledge graphData qualityNLPSearch