"Less is More": Reducing Cognitive Load and Task Drift in Real-Time Multimodal Assistive Agents for the Visually Impaired

πŸ“… 2025-11-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address high cognitive load and severe task drift in visual-language assistive systems for visually impaired users, this paper proposes VIA-Agentβ€”a real-time multimodal embodied agent system. Methodologically, it introduces target persistence design and calibration simplicity mechanisms to reduce cognitive load, and establishes an embodied interaction architecture built upon real-time communication (RTC), integrating a context protocol (MCP) pipeline with dynamic dialogue control to enable persistent goal reasoning and low-latency response. Experimental evaluation in real-world scenarios demonstrates that VIA-Agent achieves task success rates comparable to Doubao and significantly surpasses BeMyAI. It reduces average task completion time by 39.9%, decreases dialogue turns, and substantially mitigates both cognitive load and task drift. The system attains the highest usability score among evaluated baselines.

Technology Category

Application Category

πŸ“ Abstract
Vision-Language Models (VLMs) enable on-demand visual assistance, yet current applications for people with visual impairments (PVI) impose high cognitive load and exhibit task drift, limiting real-world utility. We first conducted a formative study with 15 PVI and identified three requirements for visually impaired assistance (VIA): low latency for real-time use, minimal cognitive load, and hallucination-resistant responses to sustain trust. Informed by the formative study, we present VIA-Agent, a prototype that co-optimizes its cognitive 'brain' and interactive 'body'. The brain implements a goal-persistent design with calibrated conciseness to produce brief, actionable guidance; the body adopts a real-time communication (RTC) embodiment-evolving from a request-response model Context Protocol (MCP) pipeline-to-support fluid interaction. We evaluated VIA-Agent with 9 PVI across navigation and object retrieval in the wild against BeMyAI and Doubao. VIA-Agent significantly outperformed BeMyAI both quantitatively and qualitatively. While achieving success rates comparable to Doubao, it reduced mean task time by 39.9% (70.1 s vs. 110.7 s), required fewer conversational turns (4.3 vs. 5.0), and lowered perceived cognitive load and task drift. System Usability Scale (SUS) results aligned with these findings, with VIA-Agent achieving the highest usability. We hope this work inspires the development of more human-centered VIA systems.
Problem

Research questions and friction points this paper is trying to address.

Reducing cognitive load in real-time visual assistance
Minimizing task drift for visually impaired users
Providing hallucination-resistant responses to sustain trust
Innovation

Methods, ideas, or system contributions that make the work stand out.

Goal-persistent design with calibrated conciseness
Real-time communication embodiment replacing request-response
Co-optimized cognitive brain and interactive body
πŸ”Ž Similar Papers
No similar papers found.
Y
Yi Zhao
Department of Computing, The Hong Kong Polytechnic University
S
Siqi Wang
Department of Computing, The Hong Kong Polytechnic University
Q
Qiqun Geng
Department of Computing, The Hong Kong Polytechnic University
Erxin Yu
Erxin Yu
The Hong Kong Polytechnic University
J
Jing Li
Department of Computing, The Hong Kong Polytechnic University