🤖 AI Summary
Existing AR narrative systems rely on static object labels and coordinates, neglecting semantic relationships among objects (e.g., a wedding ring on a nightstand symbolizing marital conflict), resulting in rigid storytelling, superficial spatial utilization, and misalignment with AR Foundation. This paper proposes an object-driven narrative framework for AR, treating the physical environment as an active narrative agent. Leveraging Vision-Language Models (VLMs), it performs state-aware three-layer semantic parsing—physical, functional, and metaphorical—and establishes a bidirectional JSON narrative interface. A novel STAM evaluation framework (Spatial, Temporal, Analogical-Metaphorical) supports joint spatial-semantic anchoring and metaphorical reasoning. The framework eliminates reliance on predefined scripts, enabling dynamic mapping from environmental symbols to narrative anchors. A user study shows that 70% of participants significantly altered their perception of real-world objects due to symbolically grounded narrative delivery.
📝 Abstract
Most adaptive AR storytelling systems define environmental semantics using simple object labels and spatial coordinates, limiting narratives to rigid, pre-defined logic. This oversimplification overlooks the contextual significance of object relationships-for example, a wedding ring on a nightstand might suggest marital conflict, yet is treated as just"two objects"in space. To address this, we explored integrating Vision Language Models (VLMs) into AR pipelines. However, several challenges emerged: First, stories generated with simple prompt guidance lacked narrative depth and spatial usage. Second, spatial semantics were underutilized, failing to support meaningful storytelling. Third, pre-generated scripts struggled to align with AR Foundation's object naming and coordinate systems. We propose a scene-driven AR storytelling framework that reimagines environments as active narrative agents, built on three innovations: 1. State-aware object semantics: We decompose object meaning into physical, functional, and metaphorical layers, allowing VLMs to distinguish subtle narrative cues between similar objects. 2. Structured narrative interface: A bidirectional JSON layer maps VLM-generated metaphors to AR anchors, maintaining spatial and semantic coherence. 3. STAM evaluation framework: A three-part experimental design evaluates narrative quality, highlighting both strengths and limitations of VLM-AR integration. Our findings show that the system can generate stories from the environment itself, not just place them on top of it. In user studies, 70% of participants reported seeing real-world objects differently when narratives were grounded in environmental symbolism. By merging VLMs' generative creativity with AR's spatial precision, this framework introduces a novel object-driven storytelling paradigm, transforming passive spaces into active narrative landscapes.