🤖 AI Summary
This work addresses the challenges of real-time Text Visual Question Answering (Text VQA) on wearable devices, where high-resolution video streams incur excessive power consumption and thermal throttling, while maintaining textual context coherence across frames remains difficult. To overcome these limitations, the authors propose a hybrid architecture that exploits the asymmetric resolution requirements between OCR and visual reasoning: high-resolution OCR is applied only to salient regions, while low-resolution video is used to model visual context. Integrating lightweight video processing, selective OCR, and video large language model inference, the approach achieves 72% accuracy across five Text VQA tasks while reducing power consumption to 49% of that required by full-resolution baselines, substantially extending device battery life.
📝 Abstract
Video Large Language Models (Video LLMs) have shown remarkable progress in understanding and reasoning about visual content, particularly in tasks involving text recognition and text-based visual question answering (Text VQA). However, deploying Text VQA on wearable devices faces a fundamental tension: text recognition requires high-resolution video, but streaming high-quality video drains battery and causes thermal throttling. Moreover, existing models struggle to maintain coherent temporal context when processing text across multiple frames in real-time streams. We observe that text recognition and visual reasoning have asymmetric resolution requirements - OCR needs fine detail while scene understanding tolerates coarse features. We exploit this asymmetry with a hybrid architecture that performs selective high-resolution OCR on-device while streaming low-resolution video for visual context. On a benchmark of text-based VQA samples across five task categories, our system achieves 72% accuracy at 0.49x the power consumption of full-resolution streaming, enabling sustained VQA sessions on resource-constrained wearables without sacrificing text understanding quality.