🤖 AI Summary
Vision-language models (VLMs) exhibit insufficient performance on fine-grained tasks—such as component state detection—in AR-assisted assembly training, with the current state-of-the-art model (GPT-4o) achieving only 40.54% F1. Method: We introduce the first systematic, fine-grained vision-language dataset tailored for AR training, featuring multi-stage assembly state annotations and task-reasoning samples; we design a multi-granularity benchmark covering state detection and step reasoning, and conduct unified evaluation across nine leading VLMs. Contribution/Results: Our analysis reveals fundamental limitations in cross-modal fine-grained alignment, and we propose a technical pathway to enhance pixel-level–semantic-level协同 understanding. All resources—including dataset, benchmark, and evaluation code—are fully open-sourced. Additionally, we explicitly incorporate accessibility considerations for visually impaired users, advancing equitable and precise multimodal intelligent assistance for AR-based learning.
📝 Abstract
Vision-language models (VLMs) are essential for enabling AI-powered smart assistants to interpret and reason in multimodal environments. However, their application in augmented reality (AR) training remains largely unexplored. In this work, we introduce a comprehensive dataset tailored for AR training, featuring systematized vision-language tasks, and evaluate nine state-of-the-art VLMs on it. Our results reveal that even advanced models, including GPT-4o, struggle with fine-grained assembly tasks, achieving a maximum F1 score of just 40.54% on state detection. These findings highlight the demand for enhanced datasets, benchmarks, and further research to improve fine-grained vision-language alignment. Beyond technical contributions, our work has broader social implications, particularly in empowering blind and visually impaired users with equitable access to AI-driven learning opportunities. We provide all related resources, including the dataset, source code, and evaluation results, to support the research community.