🤖 AI Summary
This work addresses the problem of estimating liquid height, container geometry, pouring flow rate, and fill-time solely from pouring audio. We theoretically establish, for the first time, a unique mapping between the fundamental frequency (pitch) of pouring sounds and liquid level, cross-sectional area, and flow rate. Leveraging this physical insight, we design a physics-informed pitch detection objective function and integrate it with physics-based modeling, simulation-driven supervised training, vision-assisted self-supervised learning, and a deep pitch estimation network. We introduce the first large-scale real-world pouring video-audio dataset. Evaluated across diverse container shapes and unseen environments—including YouTube videos—our method achieves a mean absolute error (MAE) of <1.2 cm in liquid-level estimation, <8% relative error in container dimension estimation, and <0.5 s MAE in fill-time prediction. The approach demonstrates strong generalization capability and robustness to environmental variations.
📝 Abstract
We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill. To this end, we: (i) show in theory that these properties can be determined from the fundamental frequency (pitch); (ii) train a pitch detection model with supervision from simulated data and visual data with a physics-inspired objective; (iii) introduce a new large dataset of real pouring videos for a systematic study; (iv) show that the trained model can indeed infer these physical properties for real data; and finally, (v) we demonstrate strong generalization to various container shapes, other datasets, and in-the-wild YouTube videos. Our work presents a keen understanding of a narrow yet rich problem at the intersection of acoustics, physics, and learning. It opens up applications to enhance multisensory perception in robotic pouring.