GLIMPSE : Real-Time Text Recognition and Contextual Understanding for VQA in Wearables

📅 2026-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of real-time Text Visual Question Answering (Text VQA) on wearable devices, where high-resolution video streams incur excessive power consumption and thermal throttling, while maintaining textual context coherence across frames remains difficult. To overcome these limitations, the authors propose a hybrid architecture that exploits the asymmetric resolution requirements between OCR and visual reasoning: high-resolution OCR is applied only to salient regions, while low-resolution video is used to model visual context. Integrating lightweight video processing, selective OCR, and video large language model inference, the approach achieves 72% accuracy across five Text VQA tasks while reducing power consumption to 49% of that required by full-resolution baselines, substantially extending device battery life.

Technology Category

Application Category

📝 Abstract
Video Large Language Models (Video LLMs) have shown remarkable progress in understanding and reasoning about visual content, particularly in tasks involving text recognition and text-based visual question answering (Text VQA). However, deploying Text VQA on wearable devices faces a fundamental tension: text recognition requires high-resolution video, but streaming high-quality video drains battery and causes thermal throttling. Moreover, existing models struggle to maintain coherent temporal context when processing text across multiple frames in real-time streams. We observe that text recognition and visual reasoning have asymmetric resolution requirements - OCR needs fine detail while scene understanding tolerates coarse features. We exploit this asymmetry with a hybrid architecture that performs selective high-resolution OCR on-device while streaming low-resolution video for visual context. On a benchmark of text-based VQA samples across five task categories, our system achieves 72% accuracy at 0.49x the power consumption of full-resolution streaming, enabling sustained VQA sessions on resource-constrained wearables without sacrificing text understanding quality.
Problem

Research questions and friction points this paper is trying to address.

Text VQA
wearable devices
real-time text recognition
temporal context
power consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid architecture
resolution asymmetry
on-device OCR
low-power VQA
wearable computing
🔎 Similar Papers
No similar papers found.