🤖 AI Summary
Existing olfactory media systems struggle to automatically generate interpretable, dynamically synchronized scent cues for videos, relying instead on manual design with limited generalizability. This work proposes a two-stage video-to-olfaction planning framework: first, semantic content is extracted from video using vision-language models such as CLIP; then, large language models map this semantic representation into human-interpretable scent plans that align with on-screen actions. The approach demonstrates, for the first time, that semantically grounded odor planning is comprehensible even prior to physical scent release. User studies confirm that the generated scent plans significantly outperform baseline methods in both perceptual salience and temporal alignment with visual actions, thereby establishing the feasibility of semantics-driven olfactory media.
📝 Abstract
Olfactory cues can enhance immersion in interactive media, yet smell remains rare because it is difficult to author and synchronize with dynamic video. Prior olfactory interfaces rely on designer triggers and fixed event-to-odor mappings that do not scale to unconstrained content. This work examines whether semantic planning for smell is intelligible to people before physical scent delivery. We present a video-to-scent planning pipeline that separates visual semantic extraction using a vision-language model from semantic-to-olfactory inference using a large language model. Two survey studies compare system-generated scent plans with over-inclusive and naive baselines. Results show consistent preference for plans that prioritize perceptually salient cues and align scent changes with visible actions, supporting semantic planning as a foundation for future olfactory media systems.