🤖 AI Summary
Existing CAVE systems predominantly rely on pre-authored content, lacking dynamic narrative responsiveness and flexible physical–digital integration. This paper introduces the first room-scale generative interactive narrative framework that transforms physical environments in real time into immersive, story-responsive spaces. Leveraging real-time camera-based tracking, cylindrical projection mapping, and generative AI orchestration, the system enables object-level physical–digital replacement, speech-driven plot progression, and multimodal narration—orchestrated by an AI narrator generating synchronized audio, dialogue, and scene visuals. The architecture integrates computer vision, automatic speech recognition/synthesis, environmental sound generation, and multimodal interaction. A user study (n=13) demonstrates statistically significant improvements in immersion and engagement, with the AI narrator and generated audio identified as the most impactful factors. Key technical limitations identified include end-to-end latency and projected image resolution—highlighting critical avenues for future optimization.
📝 Abstract
While Cave Automatic Virtual Environment (CAVE) systems have long enabled room-scale virtual reality and various kinds of interactivity, their content has largely remained predetermined. We present extit{Storycaster}, a generative AI CAVE system that transforms physical rooms into responsive storytelling environments. Unlike headset-based VR, extit{Storycaster} preserves spatial awareness, using live camera feeds to augment the walls with cylindrical projections, allowing users to create worlds that blend with their physical surroundings. Additionally, our system enables object-level editing, where physical items in the room can be transformed to their virtual counterparts in a story. A narrator agent guides participants, enabling them to co-create stories that evolve in response to voice commands, with each scene enhanced by generated ambient audio, dialogue, and imagery. Participants in our study ($n=13$) found the system highly immersive and engaging, with narrator and audio most impactful, while also highlighting areas for improvement in latency and image resolution.