🤖 AI Summary
This study investigates the differential impact of visual information on linguistic representations learned by speech and text encoders in multimodal training. Methodologically, it employs global representational similarity analysis and targeted clustering to systematically assess how visual grounding contributes to word identity discrimination versus semantic separability. Results reveal that visual supervision substantially improves cross-modal representation alignment but predominantly enhances surface-level identity discrimination—e.g., orthographic or phonemic distinctions—rather than deep semantic structure. Notably, in speech encoders, visual grounding fails to improve class separability in semantic embedding spaces. These findings expose an intrinsic limitation of current vision-augmented speech representation learning: visual signals preferentially drive low-level acoustic-visual coupling rather than high-level semantic abstraction. The work thus provides critical empirical evidence for designing modality-specific, semantics-oriented visual injection mechanisms—highlighting the need to decouple phonetic identity learning from semantic representation learning in multimodal architectures.
📝 Abstract
How does visual information included in training affect language processing in audio- and text-based deep learning models? We explore how such visual grounding affects model-internal representations of words, and find substantially different effects in speech- vs. text-based language encoders. Firstly, global representational comparisons reveal that visual grounding increases alignment between representations of spoken and written language, but this effect seems mainly driven by enhanced encoding of word identity rather than meaning. We then apply targeted clustering analyses to probe for phonetic vs. semantic discriminability in model representations. Speech-based representations remain phonetically dominated with visual grounding, but in contrast to text-based representations, visual grounding does not improve semantic discriminability. Our findings could usefully inform the development of more efficient methods to enrich speech-based models with visually-informed semantics.