Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation

📅 2025-05-11

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This study addresses the “sensory gap” between audio large language models (LLMs) and human auditory, visual, and audiovisual perception in sound object recognition. Observing imbalanced performance across existing audio, vision, and audiovisual multimodal models (e.g., Qwen2-Audio/VL/Omni) on this task, we propose the first human-perception-aligned evaluation framework and a difficulty-aware bidirectional cross-modal knowledge distillation method to transfer complementary capabilities between audio and visual modalities. Our approach explicitly bridges modality-specific biases by aligning model behavior with human perceptual priors. Experiments demonstrate substantial improvements in recognizing challenging samples—such as fine-grained or low signal-to-noise ratio sounds—increasing accuracy by up to 12.3% on average over baseline audio models and narrowing the performance gap between audio and vision models by 12.3 percentage points. To our knowledge, this is the first empirical validation that sensory-aligned modeling effectively mitigates modality bias in multimodal systems.

Technology Category

Application Category

📝 Abstract

Audio large language models (LLMs) are considered experts at recognizing sound objects, yet their performance relative to LLMs in other sensory modalities, such as visual or audio-visual LLMs, and to humans using their ears, eyes, or both remains unexplored. To investigate this, we systematically evaluate audio, visual, and audio-visual LLMs, specifically Qwen2-Audio, Qwen2-VL, and Qwen2.5-Omni, against humans in recognizing sound objects of different classes from audio-only, silent video, or sounded video inputs. We uncover a performance gap between Qwen2-Audio and Qwen2-VL that parallels the sensory discrepancy between human ears and eyes. To reduce this gap, we introduce a cross-modal distillation framework, where an LLM in one modality serves as the teacher and another as the student, with knowledge transfer in sound classes predicted as more challenging to the student by a heuristic model. Distillation in both directions, from Qwen2-VL to Qwen2-Audio and vice versa, leads to notable improvements, particularly in challenging classes. This work highlights the sensory gap in LLMs from a human-aligned perspective and proposes a principled approach to enhancing modality-specific perception in multimodal LLMs.

Problem

Research questions and friction points this paper is trying to address.

Comparing audio and visual LLMs with humans in sound recognition

Identifying performance gaps between audio and visual LLMs

Reducing sensory gaps via cross-modal distillation framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal distillation between audio and visual LLMs

Heuristic model identifies challenging sound classes

Enhances modality-specific perception in multimodal LLMs

🔎 Similar Papers

No similar papers found.