Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Vision-language models (VLMs) exhibit insufficient performance on fine-grained tasks—such as component state detection—in AR-assisted assembly training, with the current state-of-the-art model (GPT-4o) achieving only 40.54% F1. Method: We introduce the first systematic, fine-grained vision-language dataset tailored for AR training, featuring multi-stage assembly state annotations and task-reasoning samples; we design a multi-granularity benchmark covering state detection and step reasoning, and conduct unified evaluation across nine leading VLMs. Contribution/Results: Our analysis reveals fundamental limitations in cross-modal fine-grained alignment, and we propose a technical pathway to enhance pixel-level–semantic-level协同 understanding. All resources—including dataset, benchmark, and evaluation code—are fully open-sourced. Additionally, we explicitly incorporate accessibility considerations for visually impaired users, advancing equitable and precise multimodal intelligent assistance for AR-based learning.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) are essential for enabling AI-powered smart assistants to interpret and reason in multimodal environments. However, their application in augmented reality (AR) training remains largely unexplored. In this work, we introduce a comprehensive dataset tailored for AR training, featuring systematized vision-language tasks, and evaluate nine state-of-the-art VLMs on it. Our results reveal that even advanced models, including GPT-4o, struggle with fine-grained assembly tasks, achieving a maximum F1 score of just 40.54% on state detection. These findings highlight the demand for enhanced datasets, benchmarks, and further research to improve fine-grained vision-language alignment. Beyond technical contributions, our work has broader social implications, particularly in empowering blind and visually impaired users with equitable access to AI-driven learning opportunities. We provide all related resources, including the dataset, source code, and evaluation results, to support the research community.

Problem

Research questions and friction points this paper is trying to address.

Enhancing vision-language models for AR training tasks

Addressing poor performance in fine-grained assembly state detection

Improving AI accessibility for blind and visually impaired users

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained vision-language modeling for AR

Comprehensive dataset for AR training tasks

Evaluation of nine state-of-the-art VLMs

🔎 Similar Papers

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring