Prototyping Multimodal GenAI Real-Time Agents with Counterfactual Replays and Hybrid Wizard-of-Oz

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Multimodal generative AI real-time agents struggle to simultaneously achieve contextual awareness and low-latency responsiveness in domain-specific tasks, hindering effective prototyping. Method: We propose a novel approach integrating counterfactual video replay prompting with a hybrid Wizard-of-Oz (WoZ) methodology. It dynamically simulates multimodal real-time inputs—such as speech and screen activity—to enable fine-grained behavioral modeling and iterative refinement of user interactions in complex scenarios. Unlike conventional WoZ, our method incorporates counterfactual reasoning to enhance situational controllability and response authenticity. Contribution/Results: We implement an interactive prototype system and distill a reusable, end-to-end multimodal agent prototyping workflow, accompanied by an open-source toolkit. Empirical evaluation demonstrates significantly improved fidelity to real-world usage contexts. This work establishes a user-centered, context-aware paradigm for efficient development of intelligent agents, supporting UX designers, HCI researchers, and AI developers.

Technology Category

Application Category

📝 Abstract

Recent advancements in multimodal generative AI (GenAI) enable the creation of personal context-aware real-time agents that, for example, can augment user workflows by following their on-screen activities and providing contextual assistance. However, prototyping such experiences is challenging, especially when supporting people with domain-specific tasks using real-time inputs such as speech and screen recordings. While prototyping an LLM-based proactive support agent system, we found that existing prototyping and evaluation methods were insufficient to anticipate the nuanced situational complexity and contextual immediacy required. To overcome these challenges, we explored a novel user-centered prototyping approach that combines counterfactual video replay prompting and hybrid Wizard-of-Oz methods to iteratively design and refine agent behaviors. This paper discusses our prototyping experiences, highlighting successes and limitations, and offers a practical guide and an open-source toolkit for UX designers, HCI researchers, and AI toolmakers to build more user-centered and context-aware multimodal agents.

Problem

Research questions and friction points this paper is trying to address.

Prototyping multimodal GenAI agents for real-time contextual assistance

Addressing challenges in anticipating nuanced situational complexity

Developing user-centered design methods for domain-specific tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual video replay for iterative prototyping

Hybrid Wizard-of-Oz methods refine agent behaviors

Combined approach enables context-aware multimodal agents

🔎 Similar Papers

Can AI Prompt Humans? Multimodal Agents Prompt Players' Game Actions and Show Consequences to Raise Sustainability Awareness