🤖 AI Summary
Widespread deployment of vision-language models (VLMs) faces critical challenges in user trust deficits and the absence of rigorous, interdisciplinary evaluation frameworks. Method: We conduct a multidisciplinary investigation—including systematic literature review, cognitive modeling, empirical user studies, participatory workshops, and meta-analysis—to develop the first taxonomy of human-VLM interaction trust, integrating insights from cognitive science, collaborative agent theory, and human factors engineering. Contribution/Results: The study identifies six fundamental trust-related challenges and four key research directions; proposes a practical, implementation-oriented framework for trust assessment and enhancement; and delivers a comprehensive, theoretically grounded, and empirically informed roadmap for designing, evaluating, and deploying trustworthy VLMs—bridging foundational theory with actionable design principles and evaluation methodologies.
📝 Abstract
The rapid adoption of Vision Language Models (VLMs), pre-trained on large image-text and video-text datasets, calls for protecting and informing users about when to trust these systems. This survey reviews studies on trust dynamics in user-VLM interactions, through a multi-disciplinary taxonomy encompassing different cognitive science capabilities, collaboration modes, and agent behaviours. Literature insights and findings from a workshop with prospective VLM users inform preliminary requirements for future VLM trust studies.