🤖 AI Summary
Multimodal generative AI real-time agents struggle to simultaneously achieve contextual awareness and low-latency responsiveness in domain-specific tasks, hindering effective prototyping. Method: We propose a novel approach integrating counterfactual video replay prompting with a hybrid Wizard-of-Oz (WoZ) methodology. It dynamically simulates multimodal real-time inputs—such as speech and screen activity—to enable fine-grained behavioral modeling and iterative refinement of user interactions in complex scenarios. Unlike conventional WoZ, our method incorporates counterfactual reasoning to enhance situational controllability and response authenticity. Contribution/Results: We implement an interactive prototype system and distill a reusable, end-to-end multimodal agent prototyping workflow, accompanied by an open-source toolkit. Empirical evaluation demonstrates significantly improved fidelity to real-world usage contexts. This work establishes a user-centered, context-aware paradigm for efficient development of intelligent agents, supporting UX designers, HCI researchers, and AI developers.
📝 Abstract
Recent advancements in multimodal generative AI (GenAI) enable the creation of personal context-aware real-time agents that, for example, can augment user workflows by following their on-screen activities and providing contextual assistance. However, prototyping such experiences is challenging, especially when supporting people with domain-specific tasks using real-time inputs such as speech and screen recordings. While prototyping an LLM-based proactive support agent system, we found that existing prototyping and evaluation methods were insufficient to anticipate the nuanced situational complexity and contextual immediacy required. To overcome these challenges, we explored a novel user-centered prototyping approach that combines counterfactual video replay prompting and hybrid Wizard-of-Oz methods to iteratively design and refine agent behaviors. This paper discusses our prototyping experiences, highlighting successes and limitations, and offers a practical guide and an open-source toolkit for UX designers, HCI researchers, and AI toolmakers to build more user-centered and context-aware multimodal agents.