π€ AI Summary
This work addresses the limitations of general-purpose physical AI models in real-world retail environments, where performance is hindered by insufficient joint understanding of spatial structure, physical dynamics, and embodied behavior. To bridge this gap, the authors introduce the first unified framework within a single real-world deployment domain that integrates three core knowledge dimensions: spatial layout, temporal physical dynamics, and embodied actions. They construct a large-scale, multi-view retail video dataset comprising 270,000 samples, captured from first-person, third-person, and 360-degree perspectives, and annotated with open-ended, chain-of-thought, and multiple-choice supervision signals. Fine-tuning an embodied vision-language model on this dataset yields substantial improvements: an average error reduction of 66.6% across more than 20 custom-designed capability probes and a 36.4% gain in embodied action understanding accuracy, significantly enhancing the modelβs holistic comprehension in realistic settings.
π Abstract
A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360Β° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at https://dreamvu.ai/prism