ProCap: Projection-Aware Captioning for Spatial Augmented Reality

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge in spatial augmented reality (SAR) where existing vision-language models struggle to distinguish between physical scenes and projected content, often leading to semantic ambiguity. To resolve this, the authors propose ProCap, a novel framework that explicitly decouples virtual and physical semantics through a two-stage process: first automatically segmenting the two layers, then employing region-aware retrieval to disambiguate distortions caused by projection. This work presents the first approach to achieve semantic disentanglement of virtual and physical content in SAR, introduces RGBP—the first large-scale SAR semantic benchmark dataset comprising 65 scenes and over 180,000 projected samples—and proposes a dual-caption evaluation protocol. Experimental results demonstrate ProCap’s effectiveness, establishing a robust semantic foundation for intelligent SAR interaction.
📝 Abstract
Spatial augmented reality (SAR) directly projects digital content onto physical scenes using projectors, creating immersive experience without head-mounted displays. However, for SAR to support intelligent interaction, such as reasoning about the scene or answering user queries, it must semantically distinguish between the physical scene and the projected content. Standard Vision Language Models (VLMs) struggle with this virtual-physical ambiguity, often confusing the two contexts. To address this issue, we introduce ProCap, a novel framework that explicitly decouples projected content from physical scenes. ProCap employs a two-stage pipeline: first it visually isolates virtual and physical layers via automated segmentation; then it uses region-aware retrieval to avoid ambiguous semantic context due to projection distortion. To support this, we present RGBP (RGB + Projections), the first large-scale SAR semantic benchmark dataset, featuring 65 diverse physical scenes and over 180,000 projections with dense, decoupled annotations. Finally, we establish a dual-captioning evaluation protocol using task-specific tokens to assess physical scene and projection descriptions independently. Our experiments show that ProCap provides a robust semantic foundation for future SAR research. The source code, pre-trained models and the RGBP dataset are available on the project page: https://ZimoCao.github.io/ProCap/.
Problem

Research questions and friction points this paper is trying to address.

Spatial Augmented Reality
Vision Language Models
virtual-physical ambiguity
semantic understanding
projection-aware captioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Projection-Aware Captioning
Spatial Augmented Reality
Vision Language Models
Semantic Decoupling
RGBP Dataset
🔎 Similar Papers
No similar papers found.