🤖 AI Summary
Conventional RGB-based OCR systems on smart glasses suffer from motion blur, limited dynamic range, and excessive bandwidth/power consumption under low-light and high-speed motion conditions, severely degrading text recognition performance.
Method: This paper proposes the first foveated OCR framework for smart glasses leveraging eye-tracking–guided event streams. It employs an event camera to capture sparse, asynchronous visual signals and dynamically foveates on the gaze-centered region using real-time eye-tracking data to suppress redundancy. A deep binary neural network enables efficient event-to-frame reconstruction, trained on synthetically generated data to enhance robustness in low-light and motion-blurred scenarios. Additionally, a multimodal large language model (MLLM) is integrated to improve semantic text understanding.
Results: Compared to RGB-frame cameras, the system reduces bandwidth by 98% and power consumption by up to 2400×, significantly extending battery life and enabling real-time operation—thereby overcoming fundamental OCR limitations of traditional vision systems under extreme conditions.
📝 Abstract
Current smart glasses equipped with RGB cameras struggle to perceive the environment in low-light and high-speed motion scenarios due to motion blur and the limited dynamic range of frame cameras. Additionally, capturing dense images with a frame camera requires large bandwidth and power consumption, consequently draining the battery faster. These challenges are especially relevant for developing algorithms that can read text from images. In this work, we propose a novel event-based Optical Character Recognition (OCR) approach for smart glasses. By using the eye gaze of the user, we foveate the event stream to significantly reduce bandwidth by around 98% while exploiting the benefits of event cameras in high-dynamic and fast scenes. Our proposed method performs deep binary reconstruction trained on synthetic data and leverages multimodal LLMs for OCR, outperforming traditional OCR solutions. Our results demonstrate the ability to read text in low light environments where RGB cameras struggle while using up to 2400 times less bandwidth than a wearable RGB camera.