🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit poor performance on scene-text question answering (STQA) in egocentric videos (e.g., driving, household tasks), particularly suffering from fundamental limitations in temporal localization, cross-frame text reasoning, and high-resolution text perception.
Method: We introduce STQA, the first benchmark dedicated to egocentric video STQA, comprising 1.5K real-world videos and 7K questions requiring both text recognition and spatiotemporal reasoning. We systematically define and evaluate multi-step text understanding under dynamic egocentric viewing. A unified evaluation framework is proposed, integrating 10 state-of-the-art models (including Gemini 1.5 Pro), auxiliary OCR-derived text inputs, and fine-grained temporal annotations.
Results: Experiments reveal that the best-performing model achieves only 33% accuracy—far below human performance—highlighting critical bottlenecks in precise temporal grounding, explicit text-visual fusion, and high-fidelity visual modeling as essential avenues for advancement.
📝 Abstract
We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance.