🤖 AI Summary
This work addresses the limitations of current intelligent vehicle systems, which predominantly rely on visual information and struggle to accurately perceive driver states and external interaction intents in complex or visually constrained environments. To overcome this, the paper proposes L-LIO, a novel framework that systematically integrates audio modality into a unified inside-outside vehicle perception system. By fusing visual and auditory signals, L-LIO enables multimodal understanding of driver anomalies (e.g., intoxication), natural language commands from passengers, and external agents’ gestures and speech. Experimental results on a newly collected real-world road audio dataset demonstrate that the proposed approach significantly enhances environmental perception accuracy and safety-aware decision-making in challenging traffic scenarios, thereby surpassing the constraints of vision-only systems and opening new dimensions for multimodal human–vehicle interaction.
📝 Abstract
The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g.,"turn after that red building") to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.