Acoustic Field Video for Multimodal Scene Understanding

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the limited perceptual capacity of existing vision–language models, which rely solely on RGB video and mono- or stereo-channel audio, thereby constraining their reasoning in everyday scene understanding. To overcome this limitation, the paper introduces “acoustic field video” as a novel multimodal input modality, leveraging a low-cost beamforming microphone array to generate real-time visualizations of spatial sound intensity. For the first time, spatial acoustic information is integrated into multimodal learning in an image-like format, enriching the model’s perceptual dimensions without requiring complex hardware. Experimental results across 402 question-answering scenarios demonstrate that incorporating acoustic field video significantly improves model accuracy from 38.3% to 67.4%, underscoring its effectiveness and innovative potential for multimodal scene understanding.

Technology Category

Application Category

📝 Abstract

We introduce and explore a new multimodal input representation for vision-language models: acoustic field video. Unlike conventional video (RGB with stereo/mono audio), our video stream provides a spatially grounded visualization of sound intensity across a scene, offering a new and powerful dimension of perceptual understanding. Our real-time pipeline uses low-cost beamforming microphone arrays that are already common in smart speakers and increasingly present in robotics and XR headsets, yet this sensing capability remains unutilized for scene understanding. To assess the value of spatial acoustic information, we constructed an evaluation set of 402 question-answer scenes, comparing a state-of-the-art VLM given conventional video with and without paired acoustic field video. Results show a clear and consistent improvement when incorporating spatial acoustic data; the VLM we test improves from 38.3% correct to 67.4%. Our findings highlight that many everyday scene understanding tasks remain underconstrained when relying solely on visual and audio input, and that acoustic field data provides a promising and practical direction for multimodal reasoning. A video demo is available at https://daehwakim.com/seeingsound

Problem

Research questions and friction points this paper is trying to address.

multimodal scene understanding

acoustic field video

vision-language models

spatial acoustic information

scene understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

acoustic field video

spatial audio

multimodal scene understanding